Understanding the Performance Numbers

As Intel and AMD are adding more and more cores to their CPUs, we encounter two main challenges to keep these CPUs scaling. Cache coherency messages can add a lot of latency and absorb a lot of bandwidth, and at the same time all those cores require more and more bandwidth. So the memory subsystem plays an important role. We still use our older stream binary. This binary was compiled by Alf Birger Rustad using v2.4 of Pathscale's C-compiler. It is a multi-threaded, 64-bit Linux Stream binary. The following compiler switches were used:

-Ofast -lm -static -mp

We ran the stream benchmark on SUSE SLES 11. The stream benchmark produces 4 numbers: copy, scale, add, triad. Triad is the most relevant in our opinion, it is a mix of the other three.

Stream TRIAD on 64 bit linux - maximum threads

The new DDR3 memory controller gives the Opteron 6100 series wings. Compared to the Opteron 2435 which uses DDR-2 800, bandwidth has increased by 130%. Each core gets more bandwidth, which should help a lot of HPC applications. It is a pity of course that the 1.8 GHz Northbridge is limiting the memory subsystem. It would be interesting to see 8-core versions with higher clocked northbridges for the HPC market.

Also notice that the new Xeon 5600 handles DDR3-1333 a lot more efficiently. We measured 15% higher bandwidth from exactly the same DDR3-1333 DIMMs compared to the older Xeon 5570.  

The other important metric for the memory subsystem is latency. Most of our older latency benchmarks (such as the latency test of CPUID) are no longer valid. So we turned to the latency test of Sisoft Sandra 2010.

  Speed (GHz) L1 (Clocks) L2 (Clocks) L3 (Clocks) Memory (ns)
Intel Xeon X5670 2.93GHz 4 10 56 87
Intel Xeon X5570 2.80GHz 4 9 47 81
AMD Opteron 6174 2.20GHz 3 16 57 98
AMD Opteron 2435 2.60GHz 3 16 56 113

 

With Nehalem, Intel increased the latency of the L1 cache from 3 cycles to 4. The tradeoff was meant to allow for future scaling as the basic architecture evolves. The Xeons have the smallest (256 KB) but the fastest L2-cache. The L3-cache of the Xeon 5570 is the fastest, but the latency advantage has disappeared on the Xeon X5670 as the cache size increased from 8 to 12 MB.

Interesting is also the fact that the move from DDR2-800 to DDR3-1333 has also decreased the latency to the memory system by about 15%. There's nothing but good news for the 12-core Opteron here: more bandwith and lower latency access per core.

Benchmark Methods and Systems Rendering: Cinebench 11.5
POST A COMMENT

58 Comments

View All Comments

  • Cogman - Tuesday, March 30, 2010 - link

    It should be noted that newer nehelam based processors have specific AES encryption instructions. The benchmark where the xeon blows everything out of the water is likely utilizing that instruction set (though, AFAIK not many real-world applications do) Reply
  • Hector1 - Tuesday, March 30, 2010 - link

    I read that Intel is expected to launch the 8-core Nehalem EX today. It'll be interesting to compare it against the 12-core Magny Cours. Both are on a 45nm process. Reply
  • spoman - Tuesday, March 30, 2010 - link

    You stated "... that kind of bandwidth is not attainable, not even in theory because the next link in the chain, the Northbridge ...".

    How does the Northbridge affect memory BW if the memory is connected directly to the processor?
    Reply
  • JohanAnandtech - Wednesday, March 31, 2010 - link

    Depending on your definition, the nortbridge is in the CPU. AMD uses "northbride" in its own slides to refer to the part where the memory controller etc. resides. Reply
  • Pari_Rajaram - Tuesday, March 30, 2010 - link

    Why don't you add STREAM and LINPACK to your benchmark suites? These are very important benchmarks for HPC.


    Reply
  • JohanAnandtech - Wednesday, March 31, 2010 - link

    Stream... in the review. Reply
  • piooreq - Wednesday, March 31, 2010 - link

    Hi Johan,
    For last few days I did several tests with Swingbench CC with similar database configuration but I achieved a bit different results, I’m just wondering what exactly settings you put for CC test itself. I mean about when you generate schema and data for that test? Thanks for answer.
    Reply
  • JohanAnandtech - Thursday, April 01, 2010 - link

    Your question is not completely clear to me. What is the info you would like? You can e-mail if you like at johanATthiswebsitePointcom Reply
  • zarjad - Wednesday, March 31, 2010 - link

    Can't figure out if hyperthreading were enabled on Intels. Particularly interested in virtualization benchmark with hyperthreading both enabled and disabled. Also of interest would be an Office benchmark with a bunch of small VMs (1.5 to 2GB) to simulate VDI configuration. Reply
  • JohanAnandtech - Thursday, April 01, 2010 - link

    Hyperthreading is always on. But we will follow up on that. A VDI based hypervisor tests is however not immediately on the horizon. The people of the VRC project might do that though. Google on the VRC project.

    Reply

Log in

Don't have an account? Sign up now