The buyer's market approach: our newest testing methods

Astute readers have probably understood what we'll change in this newest server CPU evaluation, but we will let one of the professionals among our readers provide his excellent feedback on the question of improving our evaluations at it.anandtech.com:

"Increase your time horizon. Knowing the performance of the latest and greatest may be important, but most shops are sitting on stuff that's 2-3 years old. An important data point is how the new compares to the old. (Or to answer management's question: what does the additional money get us vs. what we have now? Why should we spend money to upgrade?)"

To help answer this question, we will include a 3 year old system in this review: a dual Dempsey system, which was introduced in the spring of 2006. The Dempsey or Xeon 5080 server might even be "too young", but as it is based on the "Blackford" chipset, it allows us to use the same FB-DIMMs as can be found in new Harpertown (Xeon 54xx) systems. That is important as most of our tests require quite large amounts of memory.

A 3.73GHz Xeon 5080 Dempsey performed roughly equal to a 2.3GHz Xeon 51xx Woodcrest and 2.6GHz dual-core Opteron in SAP and TPC-C. That should give you a few points of comparison, even though none of them are very meaningful. After all, we are using this old reference system to find out if the newest CPU is 2, 5, or 10 times faster; a few percent or more does not matter in that case.

In our Shanghai review, we radically changed our benchmark methodology. Instead of throwing every software box we happen to have on the shelf and know very well at our servers, we decided that the "buyers" should dictate our benchmark mix. Basically, every software type that is really important should have at least one and preferably two representatives in the benchmark suite. In the table below, you will find an overview of the software types servers are bought for and the benchmarks you can find in this review. If you want more detail about each of these software packages, please refer to this page.

Benchmark Overview
Server Software Market Importance Benchmarks used
ERP, OLTP 10-14% SAP SD 2-tier (Industry Standard benchmark)
Oracle Charbench (Free available benchmark)
Dell DVD Store (Open Source benchmark tool)
Reporting, OLAP 10-17% MS SQL Server (Real world + vApus)
Collaborative 14-18% MS Exchange LoadGen (MS own load generator for MS Exchange)
Software Dev. 7% Not yet
e-mail, DC, file/print 32-37% MS Exchange LoadGen
Web 10-14% MCS eFMS (Real World + vApus)
HPC 4-6% LS-DYNA, LINPACK (Industry Standard)
Other 2%? 3DSMax (Our own bench)
Virtualization 33-50% VMmark (Industry standard)
vApus test (in a later review)

The combination of an older reference system and real world benchmarks that closely match the software that servers are bought for should offer you a new and better way of comparing server CPUs. We complement our own benchmarks with the more reliable industry standard benchmarks (SAP, VMmark) to reach this goal.

A look inside the lab

We had two weeks to test Nehalem, and tests like the exchange tests and the OLTP tests take more than half a day to set up and perform - not to mention that it sometimes takes months to master them. Understanding how to properly configure a mail server like Exchange is completely different from configuring a database server. It is clear that our testing is now clearly beyond what one person needs to know to perform all these tests. I would like to thank my colleagues at the Sizing Servers Lab for helping to perform all this complicated testing: Tijl Deneut, Liz Van Dijk, Thomas Hofkens, Joeri Solie, and Hannes Fostie. The Sizing Servers Lab is part of Howest, which is part of the Ghent University in Belgium. The most popular parts of our research are published here at it.anandtech.com.

 


Liz proudly showing that she was first to get the MS SQL Server testing done. Notice the missing parts: the Shanghai at 2.9GHz (still in the air) and the Linux Oracle OLTP test that we are still trying to get right.

 

The SQL Server and website testing was performed with vApus, or "Virtual Application Unique Stress testing" tool. This tool took our team led by Dieter Vandroemme two years of research and programming, but it was well worth it. It allows us to stress test real world databases, websites, and other applications with the real logs that applications produce. vApus simulates the behavior not just by replaying the logs, but by intelligently choosing the actions that real users would perform using the different statistical distributions.


You can see vApus in action in the picture above. Note that the errors are time-outs. For each selection of concurrent users we see the number of responses and the average response time. It is possible to dig deeper to examine the response time of each individual action. An action is one or more queries (Databases) or a number of URLs that for example are necessary to open one webpage.

The reason why we feel that it is important to use real world applications of lesser-known companies is that these kind of benchmarks are impossible to optimize for. Manufacturers sometimes include special optimizations in their JVM, compilers, and other developer tools with the sole purpose of gaining a few points in well-known benchmarks. These benchmarks allows us to perform a real world sanity check.

What Intel is Offering Benchmark Configuration
Comments Locked

44 Comments

View All Comments

  • gwolfman - Tuesday, March 31, 2009 - link

    Why was this article pulled yesterday after it first posted?
  • JohanAnandtech - Tuesday, March 31, 2009 - link

    Because the NDA date was noon in the pacific zone and not CET. We were slightly too early...
  • yasbane - Tuesday, March 31, 2009 - link

    Hi Johan,

    Any chance of some more comprehensive Linux benchmarks? Haven't seen any on IT Anandtech for a while.

    cheers
  • JohanAnandtech - Tuesday, March 31, 2009 - link

    Yes, we are working on that. Our first Oracle testing is finished on the AMD's platform, but still working on the rest.

    Mind you, all our articles so far have included Linux benchmarking. All mysql testing for example, Stream, Specjbb and Linpack.
  • Exar3342 - Monday, March 30, 2009 - link

    Thanks for the extremely informative and interesting review Johan. I am definitely looking forward to more server reviews; are the 4-way CPUs out later this year? That will be interesting as well.
  • Exar3342 - Monday, March 30, 2009 - link

    Forgot to mention that I was suprised HT has such an impact that it did in some of the benches. It made some huge differences in certain applications, and slightly hindered it in others. Overall, I can see why Intel wanted to bring back SMT for the Nehalem architecture.
  • duploxxx - Monday, March 30, 2009 - link

    awesome performance, but would like to see how the intel 5510-20-30 fare against the amd 2378-80-82 after all that is the same price range.

    It was the same with woodcrest and conroe launch, everybody saw huge performance lead but then only bought the very slow versions.... then the question is what is still the best value performance/price/power.

    Istanbul better come faster for amd, how it looks now with decent 45nm power consumption it will be able to bring some battle to high-end 55xx versions.
  • eryco - Tuesday, April 14, 2009 - link

    Very informative article... I would also be interested in seeing how any of the midrange 5520/30 Xeons compare to the 2382/84 Opterons. Especially now that some vendors are giving discounts on the AMD-based servers, the premium for a server with X5550/60/70s is even bigger. It would be interesting to see how the performance scales for the Nehalem Xeons, and how it compares to Shanghai Opterons in the same price range. We're looking to acquire some new servers and we can afford 2P systems with 2384s, but on the Intel side we can only go as far as E5530s. Unfortunately there's no performance data for Xeons in the midrange anywhere online so we can make a comparison.
  • haplo602 - Monday, March 30, 2009 - link

    I only skimmed the graphs, but how about some consistency ? some of the graphs feature only dual core opterons, some have a mix of dual and quad core ... pricing chart also features only dual core opterons ...

    looking just at the graphs, I cannot make any conclusion ...
  • TA152H - Monday, March 30, 2009 - link

    Part of the problem with the 54xx CPUs is not the CPUs themselves, but the FB-DIMMS. Part of the big improvement for the Nehalem in the server world is because Intel sodomized their 54xx platform, for reasons that escape most people, with the FB-DIMMs. But, it's really not mentioned except with regards to power. If the IMC (which is not an AMD innovation by the way, it's been done many times before they did it, even on the x86 by NexGen, a company they later bought) is so important, then surely the FB-DIMMs are. They both are related to the same issue - memory latency.

    It's not really important though, since that's what you'd get if you bought the Intel 54xx; it's more of an academic complaint. But, I'd like to see the Nehalem tested with dual channel memory, which is a real issue. The reason being, it has lower latency while only using two channels, and for some benchmarks, certainly not all or even the majority, you might see better performance by using two (or maybe it never happens). If you're running a specific application that runs better using dual channel, it would be good to know.

    Overall, though, a very good article. The first thing I mention is a nitpick, the second may not even matter if three channel performance is always better.

Log in

Don't have an account? Sign up now