Understanding the Performance Numbers

As Intel and AMD are adding more and more cores to their CPUs, we encounter two main challenges to keep these CPUs scaling. Cache coherency messages can add a lot of latency and absorb a lot of bandwidth, and at the same time all those cores require more and more bandwidth. So the memory subsystem plays an important role. We still use our older stream binary. This binary was compiled by Alf Birger Rustad using v2.4 of Pathscale's C-compiler. It is a multi-threaded, 64-bit Linux Stream binary. The following compiler switches were used:

-Ofast -lm -static -mp

We ran the stream benchmark on SUSE SLES 11. The stream benchmark produces 4 numbers: copy, scale, add, triad. Triad is the most relevant in our opinion, it is a mix of the other three.

Stream TRIAD on 64 bit linux - maximum threads

The new DDR3 memory controller gives the Opteron 6100 series wings. Compared to the Opteron 2435 which uses DDR-2 800, bandwidth has increased by 130%. Each core gets more bandwidth, which should help a lot of HPC applications. It is a pity of course that the 1.8 GHz Northbridge is limiting the memory subsystem. It would be interesting to see 8-core versions with higher clocked northbridges for the HPC market.

Also notice that the new Xeon 5600 handles DDR3-1333 a lot more efficiently. We measured 15% higher bandwidth from exactly the same DDR3-1333 DIMMs compared to the older Xeon 5570.  

The other important metric for the memory subsystem is latency. Most of our older latency benchmarks (such as the latency test of CPUID) are no longer valid. So we turned to the latency test of Sisoft Sandra 2010.

  Speed (GHz) L1 (Clocks) L2 (Clocks) L3 (Clocks) Memory (ns)
Intel Xeon X5670 2.93GHz 4 10 56 87
Intel Xeon X5570 2.80GHz 4 9 47 81
AMD Opteron 6174 2.20GHz 3 16 57 98
AMD Opteron 2435 2.60GHz 3 16 56 113

 

With Nehalem, Intel increased the latency of the L1 cache from 3 cycles to 4. The tradeoff was meant to allow for future scaling as the basic architecture evolves. The Xeons have the smallest (256 KB) but the fastest L2-cache. The L3-cache of the Xeon 5570 is the fastest, but the latency advantage has disappeared on the Xeon X5670 as the cache size increased from 8 to 12 MB.

Interesting is also the fact that the move from DDR2-800 to DDR3-1333 has also decreased the latency to the memory system by about 15%. There's nothing but good news for the 12-core Opteron here: more bandwith and lower latency access per core.

Benchmark Methods and Systems Rendering: Cinebench 11.5
POST A COMMENT

58 Comments

View All Comments

  • Accord99 - Monday, March 29, 2010 - link

    The X5670 is 6-core. Reply
  • JackPack - Tuesday, March 30, 2010 - link

    LOL. Based on price?

    Sorry, but you do realize that the majority of these 6-core SKUs will be sold to customers where the CPU represents a small fraction of the system cost?

    We're talking $40,000 to $60,000 for a chassis and four fully loaded blades. A couple hundred dollars difference for the processor means nothing. What's important is the performance and the RAS features.
    Reply
  • JohanAnandtech - Tuesday, March 30, 2010 - link

    Good post. Indeed, many enthusiast don't fully understand how it works in the IT world. Some parts of the market are very price sensitive and will look at a few hundreds of dollars more (like HPC, rendering, webhosting), as the price per server is low. A large part of the market won't care at all. If you are paying $30K for a software license, you are not going to notice a few hundred dollars on the CPUs. Reply
  • Sahrin - Tuesday, March 30, 2010 - link

    If that's true, then why did you benchmark the slower parts at all? If it only matters in HPC, then why test it in database? Why would the IDM's spend time and money binning CPU's?

    Responding with "Product differentiation and IDM/OEM price spreads" simply means that it *does* matter from a price perspetive.
    Reply
  • rbbot - Saturday, July 10, 2010 - link

    Because those of us with applications running on older machines need comparisons against older systems in order to determine whether it is worth migrating existing applications to a new platform. Personally, I'd like to see more comparisons to even older kit in the 2-3 year range that more people will be upgrading from. Reply
  • Calin - Monday, March 29, 2010 - link

    Some programs were licensed by physical processor chips, others were licensed by logical cores. Is this still correct, and if so, could you explain in based on the software used for benchmarking?
    Reply
  • AmdInside - Monday, March 29, 2010 - link

    Can we get any Photoshop benchmarks? Reply
  • JohanAnandtech - Monday, March 29, 2010 - link

    I have to check, but I doubt that besides a very exotic operation anything is going to scale beyond 4-8 cores. These CPUs are not made for Photoshop IMHO. Reply
  • AssBall - Tuesday, March 30, 2010 - link

    Not sure why you would be running photoshop on a high end server. Reply
  • Nockeln - Tuesday, March 30, 2010 - link

    I would recommend trying to apply some advanced filters on a 200+ GB file.

    Especially with the new higher megapixel cameras I could easilly see how some proffesionals would fork up the cash if this reduces the time they have to spend in front of the screen waiting on things to process.


    Reply

Log in

Don't have an account? Sign up now