Benchmark Configuration

Since AMD sent us a 1U Supermicro server, we had to resort to testing with our 1U servers again. That is why we went back to the ASUS RS700 for the Xeon server.

Supermicro A+ server 1022G-URG (1U Chassis)

CPU 2x AMD Opteron Interlagos 6276 (2.3GHz, 8 cores per CPU, 16 integer clusters)
2x AMD Opteron Interlagos 6220 (3.0GHz, 4 cores per CPU, 8 integer clusters)
2x AMD Opteron Magny-Cours 6174 (2.2GHz, 12 cores per CPU)
RAM 64GB (8x8GB) DDR3-1600 Samsung M393B1K70DH0-CK0
Motherboard SuperMicro H8DGU-F
Chipset AMD Chipset SR5670 + SP5100
BIOS version v2.81 (10/28/2011)
PSU SuperMicro PWS-704P-1R 750Watt

The AMD CPUS have four memory channels per CPU. The new Interlagos Bulldozer CPU supports DDR3-1600 and thus our dual-CPU configuration uses eight DIMMs for maximum bandwidth and performance. We ran with one DIMM per channel.

Asus RS700-E6/RS4 1U Server

CPU 2x Intel Xeon X5650 (2.66GHz, 6 cores/12 threads)
RAM 48GB (12x4GB) Kingston DDR3-1333 FB372D3D4P13C9ED1
Motherboard Asus Z8PS-D12-1U
Chipset Intel 5520
BIOS version 1102 (08/25/2011)
PSU 770W Delta Electronics DPS-770AB

To speed up testing, we ran the Intel Xeon and AMD Opteron system in parallel. As we didn't have more than eight 8GB DIMMs, we used our 4GB DDR3-1333 DIMMs for the Xeon server. The Xeon system only ends up with 48GB, but this is no disadvantage as our benchmark with the highest memory footprint (Nieuws.be/SQL Server 5 tiles) uses no more than 30GB of RAM.

We measured the difference between 12x4GB and 8x8GB of RAM and recalculated the power consumption for our power measurements (note that the differences were very small). There is no practical alternative as our Xeon has three memory channels and cannot be optimally configured with the same amount of RAM as our Opteron system (which has four channels).

We chose the Xeons based on AMD's positioning. The Xeon X5649 is priced at the same level as the Opteron 6276 but we didn't have the X5649 in the labs. As we suggested in our previous article, the Opteron 6276 should reach the performance of the X5650 to be attractive, so we tested with the X5650.

Common Storage System

Both servers used intel 710 SSDs for storing the database.

Software configuration

All Windows testing was done on Windows 2008 R2 SP1. The Linux tests are done on Ubuntu 11.10 Linux kernel 3.0.0-14 SMP x86_64.

Other

Both servers were fed by a standard European 230V (16 Amps max.) powerline. The room temperature was monitored and kept at 23°C by our Airwell CRACs. We used the Racktivity ES1008 Energy Switch PDU to measure power. Using a PDU for accurate power measurements might seem pretty insane, but this is not your average PDU. Measurement circuits of most PDUs assume that the incoming AC is a perfect sine wave but it never is. However, the Rackitivity PDU measures true RMS current and voltage at a very high sample rate: up to 20,000 measurements per second for the complete PDU.

Introduction SQL Server 2008 R2 "OLAP" Workload
Comments Locked

46 Comments

View All Comments

  • Scali - Saturday, February 11, 2012 - link

    "It also reduces throughput."

    No, it improves throughput, assuming we are talking from improvement going from 1 physical core to 2 logical cores.
    Clearly two logical cores (on the same physical core) have less throughput than two physical cores, but that's obvious since you only have half the hardware.

    And that, together with the fact that Intel's SMT chips have far better single-threaded performance to begin with, results in very good performance per die area (you know, that thing that people used to praise AMD GPUs for).

    "Yes, it does, via the implementation of all that shared hardware on the chip."

    You can't say that, since there is no non-modular version of Bulldozer (just as there is no non-HT version of the Intel architectures).
    However, if you compare a 4-core HT architecture with a non-HT architecture, be that a Core2 Quad or a Phenom X4, Intel's transistorcount is clearly in the same ballpark, so HT does not add much in terms of transistorcount.

    With CMT we see little or no indication of reduced transistorcount. AMD's 4-module chips are MUCH larger than regular 4-core chips have been. In fact, AMD"s 4-module design is even larger than Intel's 6-core design with HT.

    "Two different approaches to the same idea."

    I disagree. SMT is a very different idea from CMT (which is a bogus marketing term invented by AMD anyway). CMT is more of a marketing excuse for not having proper SMT, and shows no merit in actual silicon.

    "but I don't think we can label one as inherently better than the other yet."

    Well clearly we disagree on that then.
    I think SMT is clearly inherently better than CMT. SMT has far more flexible sharing of resources than AMD's half-baked approach.
    And any theoretical disadvantages (fighting over resources and all that) can be put to bed with benchmarking such as in this review: the disadvantages may exist, but the net performance is unbeatable anyway. A midrange Xeon schools a CMT-based chip of twice the size.
  • Andexxx - Wednesday, February 15, 2012 - link

    Well, there are a lot of factors affecting single-threaded performance in real life. So CMT indeed has its scaling advantages as tests suggested. At least most of the things should be constant when comparing CMT-on and CMT-off, while comparing SMT and CMT on different implementations is not. Lack of single-threaded performance is not a valid point of blaming CMT.

    If you want to *proof* CMT is a half-baked marketing crap while SMT is the only solution, what you need is a SMT-based AMD BD monolithic core or a CMT-based Intel monolithic module for comparison.

    For the transistors counting, well, that's their choice of making such a cache and uncore configuration. You can keep telling 4-module chip is blahblahblah, but in some cases it beats a 4C8T Xeon chips. Transistors is not a big matter from customer viewpoint but just the producer viewpoint. If you want to argue with GPU's performance metrics, GPU is a data-parallel processor with bunch of logic units, while CPU is a latency-sensitive girlfriend of caches. Large amount of cache can make your Performance/mm^2 or Performance/transistors look worse. So trade-offs on the amount of cache should have been done before they started to design the chip.
  • Scali - Wednesday, February 15, 2012 - link

    Well, one of the reasons why AMD's current CPUs have such poor single-threaded performance is because they moved from 3 ALUs per thread to 2 ALUs per thread.
    This is part of the whole CMT design.
    So in that sense, CMT can be blamed for the poor single-threaded performance at least.
    And since single-threaded performance is so bad, it is only logical that scaling to more threads is relatively good.
    On a CPU with faster single-threaded performance, you run into IO limits sooner (memory, disk etc), so it is more difficult to maintain similar scaling with increased thread count.

    The strength of SMT is that Intel did not have to cut any ALUs when implementing HT. Pentium 4 Northwood with HT still had two double-pumped ALUs, like the non-HT Willamette that went before it.
    Likewise, Core i7 still has 3 ALUs, like Core2.
    Another strength of SMT is that even with one less ALU per 2 threads than CMT, it still reaches similar performance in multithreaded scenarios. CMT can not share these ALUs between threads, while SMT can.
    Conclusion: CMT is nonsense.
    For the full version, see: http://scalibq.wordpress.com/2012/02/14/the-myth-o...
  • slycer.tech - Monday, February 13, 2012 - link

    If Bulldozer arc really bad, how about this?
    http://www.marketwatch.com/story/amd-opterontm-620...
    Can someone prove this award is a big liar?
  • duploxxx - Tuesday, February 14, 2012 - link

    read the article, the baseline they use for price/performance is based on spec results....lots of companies still use these results to decide on a platform.

    but then again, benchmarks don't always show the real world value or even hard to compare since many have in house applications that don't scale or scale different like the ones benchmarked in reviews. 90% of the datacenters don't even require more then any midrange cpu, anything above midrange is wasted money and both vendors provide more then adequate solutions to that. It's the human mind that is often blocking sanity. Investing that wasted money in other solutions often provide a better total performing solution.
  • anti_shill - Monday, April 2, 2012 - link

    shill_detector by anti_shill on Monday, April 02, 2012
    Here's a more accurate reflection of Bulldozer/ interlagos performance, untainted by intel ad bucks...

    http://www.phoronix.com/scan.php?page=article&...

    But if u really want to see what the true story is, have a look at AMD's stock price lately, and their server wins. They absolutely smoke intel on virtualization, and anything that requires a lot of threads. It's not even close.

Log in

Don't have an account? Sign up now