As Intel and AMD are adding more and more cores to their CPUs, we encounter two main challenges to keep these CPUs scaling. Cache coherency messages can add a lot of latency and absorb a lot of bandwidth, and at the same time all those cores require more and more bandwidth. So the memory subsystem plays an important role. We still use our older stream binary. This binary was compiled by Alf Birger Rustad using v2.4 of Pathscale's C-compiler. It is a multi-threaded, 64-bit Linux Stream binary. The following compiler switches were used:
-Ofast -lm -static -mp
We ran the stream benchmark on SUSE SLES 11. The stream benchmark produces 4 numbers: copy, scale, add, triad. Triad is the most relevant in our opinion, it is a mix of the other three.

The new DDR3 memory controller gives the Opteron 6100 series wings. Compared to the Opteron 2435 which uses DDR-2 800, bandwidth has increased by 130%. Each core gets more bandwidth, which should help a lot of HPC applications. It is a pity of course that the 1.8 GHz Northbridge is limiting the memory subsystem. It would be interesting to see 8-core versions with higher clocked northbridges for the HPC market.
Also notice that the new Xeon 5600 handles DDR3-1333 a lot more efficiently. We measured 15% higher bandwidth from exactly the same DDR3-1333 DIMMs compared to the older Xeon 5570.
The other important metric for the memory subsystem is latency. Most of our older latency benchmarks (such as the latency test of CPUID) are no longer valid. So we turned to the latency test of Sisoft Sandra 2010.
| Speed (GHz) | L1 (Clocks) | L2 (Clocks) | L3 (Clocks) | Memory (ns) | |
| Intel Xeon X5670 | 2.93GHz | 4 | 10 | 56 | 87 |
| Intel Xeon X5570 | 2.80GHz | 4 | 9 | 47 | 81 |
| AMD Opteron 6174 | 2.20GHz | 3 | 16 | 57 | 98 |
| AMD Opteron 2435 | 2.60GHz | 3 | 16 | 56 | 113 |
With Nehalem, Intel increased the latency of the L1 cache from 3 cycles to 4. The tradeoff was meant to allow for future scaling as the basic architecture evolves. The Xeons have the smallest (256 KB) but the fastest L2-cache. The L3-cache of the Xeon 5570 is the fastest, but the latency advantage has disappeared on the Xeon X5670 as the cache size increased from 8 to 12 MB.
Interesting is also the fact that the move from DDR2-800 to DDR3-1333 has also decreased the latency to the memory system by about 15%. There's nothing but good news for the 12-core Opteron here: more bandwith and lower latency access per core.
2. The Intel systems have only 24GB ram against the 32GB ram on the 2S magny cours. That's why the 100GB database test favors the Magny cours by a large margin.