High-End x86: The Nehalem EX Xeon 7500 and Dell R810by Johan De Gelas on April 12, 2010 6:00 PM EST
Understanding the Performance Numbers
A good analysis of the memory subsystems helps to understand the strengths and weaknesses of these server systems. We still use our older stream binary. This binary was compiled by Alf Birger Rustad using v2.4 of Pathscale's C-compiler. It is a multi-threaded, 64-bit Linux Stream binary. The following compiler switches were used:
-Ofast -lm -static –mp
We ran the stream benchmark on SUSE SLES 11. The stream benchmark produces four numbers: copy, scale, add, triad. Triad is the most relevant in our opinion; it is a mix of the other three.
The Xeon X7560 fails to impress. Intel's engineers expected 36GB/s with the best optimizations. Their own gcc compiled binary (–O3 –fopenmp –static) achieves 25 to 29GB/s, in the same range as our Pathscale compiled binary.
It is interesting to note that single threaded bandwidth is mediocre at best: we got only 5GB/s with DDR3-1066. Even the six-core Opteron with DDR2-800 can reach over 8GB/s, while the newest Opteron DDR3 memory controller achieves 9.5GB/s with DDR3-1333, almost twice as much as the Xeon 7500 series. The best single-threaded performance comes out of the Xeon 5600 memory controller: 12GB/s with DDR3-1333. Intel clearly had to sacrifice some bandwidth too to achieve the enormous memory capacity (64 slots and 1TB without "extensions"). Let's look at latency.
|CPU||Speed (GHz)||L1 (clocks)||L2 (clocks)||L3 (clocks)||Memory (ns)|
The L3 cache latency of our Xeon X7560 is very impressive, considering that we are talking about a 24MB L3. Memory latency clearly suffers from the serial-buffer-parallel DRAM transitions. We also did a cache bandwidth test with SiSoft Sandra 2010.
|CPU||Speed (GHz)||L1 CPU (GB/s)||L2 CPU (GB/s)||L3 (GB/s)|
The most interesting number here is the L3 cache since all cores must access it, and it matters for almost all applications. The throughput of the L1 and L2 caches is mostly important for the few embarrassingly parallel applications. And here we see that the extra engineer on the Nehalem EX pays off: it clearly has the fastest L3 cache. The Opteron are the second fastest, but the exclusive nature of the L3 caches may need quite a bit more bandwidth. In a nutshell: the Xeon 7500 comes with probably the best L3 cache on the market, but the memory subsystem is quite a bit slower than on other server CPU systems.