SQL Server 2008 Enterprise R2

We have been using the Flemish/Dutch web 2.0 website Nieuws.be for some time. 99% of the loads on the database are selects and about 5% of them are stored procedures. You can find a more detailed description here.

We have improved our testing methodology and updated the SQL Server version, so the results are no longer comparable with previous results. We used to publish the highest throughput possible, but we have found that it is not entirely fair. At the highest throughput, there was a very small (<1%) percentage of connection errors (client side timeouts), but those timeouts could make the results vary by about 5-10%. A better configured .NET data provider improved this situation. We adapted the .Net data provider to support the same timeout as the MS SQL Server standard timeout (60s), and we now meticulously scan our logs for errors and discard all results that have any error rates.

Since turbo modes are disabled in the "Balanced" Power management policy and we want to evaluate both power and performance, we tested with both the "Balanced" and "High Performance" power management policies.

MS SQL Server 2008

We have reported this before: when it comes to pure OLAP MS SQL server throughput, the Opteron Magny-Cours is unbeatable. Notice that both the Xeon and the new Opteron 6276 can hardly leverage Turbo (e.g. "High Performance" mode) even though both Intel and AMD talk about the potential to run at higher clockspeeds even when all cores are active. While this integer intensive workload comes nowhere close to consuming what a Linpack run would require, the TDP headroom is no longer there to enable a clockspeed boost.

Looking at the results, the Opteron 6276 disappoints somewhat as it is not capable of outperforming its older brother. However, it performs relatively close to the Xeon and is thus far from a dud.

At maximum CPU load, the response times paint a very similar picture as the throughput numbers:

MS SQL Server 2008

When we look at the response times, the Opteron 6174's leadership is confirmed and emphasized. The fact that a 16-cluster 2.3GHz Opteron offers 30% more throughput and twice as fast response times as its 8-cluster 3GHz brother is a clear testimony to the excellent scaling capabilities of SQL server.

Since performance/watt is an extremely important metric, we follow up with a power measurement:

MS SQL Server 2008

There's no doubt about it: the Opteron 6174 is the performance/watt champion in this particular task.

This is the classical way to evaluate server performance, but should you base your purchase on these numbers alone? Frankly, no. The 100% full load evaluation is incomplete, and it's not related to the real world way of using a database. It just shows what your servers can spit out when running at their maximum, a situation most people either try to avoid or never see as so many other bottlenecks (I/O, lock contention) kick in before you see 100% CPU utilization. It is the best method to evaluate HPC machines, but a short-sighted method for almost any server application (web, database, etc.).

In other words, server benchmarks at 100% are just one datapoint, but we should test at lower concurrencies as well. That is why we simulated 40, 80, 100, 125, 200, 300, 400, 600 and 800 users with our vApus stress testing client. Each user starts a query with between 900 and 1100 ms "thinking time" (so on average 1 second). At 600 and 800 users, our servers achieve their maximum throughput, but how do these servers handle "low" and "midrange" workloads? Let us see what happened when we tested with everyday "normal" loads.

Benchmark Configuration MS SQL Server 2008 R2 at Medium Load
Comments Locked

46 Comments

View All Comments

  • sonofgodfrey - Thursday, February 9, 2012 - link

    Have you explicitly tested one socket vs. two sockets? We've found an immense increase in contention once a cache-line has to be shared between sockets on some systems.
  • JohanAnandtech - Friday, February 10, 2012 - link

    That is one suggestion I will try out next week. Thanks!
  • Klimax - Thursday, February 9, 2012 - link

    Hello.

    Nice tests.

    However I would like to see MySQL tested on Windows Server 2008 R2
    Would be interesting comparsion.

    (Especially due to http://channel9.msdn.com/shows/Going+Deep/Arun-Kis... )
  • Klimax - Thursday, February 9, 2012 - link

    Title of post is wrong... (I have deleted second thing and forgot to fix title)
  • Scali - Thursday, February 9, 2012 - link

    Unless I'm mistaken, the Xeon 5650 is a 1.17B transistor chip, where the Interlagos 6276 is a 2.4B transistor chip.
    In that light, doesn't that make Intel's SMT implementation a lot better than CMT?
    I mean, yes CMT may give more of a performance boost when you increase the threadcount. But considering the fact that AMD spends more than twice the number of transistors on the chip... well, that's pretty obvious.
    AMD might as well just have used conventional cores.
    The true strength of SMT is not so much that it improves performance in multithreaded scenarios, but that it does so at virtually no extra cost in terms of transistors (and with little or no impact on the single-threaded performance either).
  • JohanAnandtech - Friday, February 10, 2012 - link

    Interlagos is 1.2 billion chip (maybe 1.3 but anyway). Most of those transistors are spend on the L3 cache: about 0.5 billion. Only 213 million transistors are in a module and each module contains a 2 MB L2-cache, probably good for 120 million transistors. That leaves 90 million transistors to the core, and it has been stated that the second cluster added 12%. So that second cluster costs about 12 million transistors, or 48 million on the total 4 module die. That is less than 5% of the total transistor count but you get a 30-90% performance boost!

    So for AMD, this was clearly a great choice.

    SMT is perfect for Intel, as the Intel architecture puts all instructions in one big ROB.

    For very low IPC serverworkloads, I think the CMT approach gives better results. Unfortunately AMD lowered some of the CMT benefits by keeping the datacache so small and the low associativity of the Icache.
  • Scali - Friday, February 10, 2012 - link

    Uhhh, I think you're wrong here... the 4-module Bulldozer is a 1.2B chip (Zambezi). But you tested the 8-module Interlagos (16 threads), which is TWO Zambezi dies in one package.
    Hence 2*1.2 = 2.4B transistors.
  • JohanAnandtech - Friday, February 10, 2012 - link

    Ok, it is two chips of 1.2 billion. That doesn't change anything about our analyses of CMT.
  • Scali - Friday, February 10, 2012 - link

    Not in the article, because you did not factor in transistor count (which is the flaw I tried to point out in the first place... comparing two chips, where once is twice the transistor count of the other, is quite the apples-to-oranges comparison. One would expect a chip with twice the transistorcount to be considerably better in multithreading scenarios, not 'catching up' to the smaller chip).

    But in your above post, I think it changes everything about your analysis. All your figures have to be done times two.
    Which makes it a very poor comparison, not only to Intel, but also to AMD's own previous line of CPUs.
    The 6174 Magny Cours is actually beating Interlagos, with 'only' 12 threads, no kind of CMT/SMT, and 'only' 1.8B transistors.

    How does that make CMT look like a great choice for AMD?
  • slycer.tech - Friday, February 10, 2012 - link

    What i read on benchmark configuration page, Anand used 2x Intel Xeon X5650. So 2x 1.17B = 2.34B. I think it is comparable to AMD CPU used in this test. Am I right?

Log in

Don't have an account? Sign up now