Understanding the Performance Numbers

A good analysis of the memory subsystems helps to understand the strengths and weaknesses of these server systems. We still use our older stream binary. This binary was compiled by Alf Birger Rustad using v2.4 of Pathscale's C-compiler. It is a multi-threaded, 64-bit Linux Stream binary. The following compiler switches were used:

-Ofast -lm -static –mp

We ran the stream benchmark on SUSE SLES 11. The stream benchmark produces four numbers: copy, scale, add, triad. Triad is the most relevant in our opinion; it is a mix of the other three.

Stream TRIAD on 64-bit Linux—maximum threads

The Xeon X7560 fails to impress. Intel's engineers expected 36GB/s with the best optimizations. Their own gcc compiled binary (–O3 –fopenmp –static) achieves 25 to 29GB/s, in the same range as our Pathscale compiled binary.

It is interesting to note that single threaded bandwidth is mediocre at best: we got only 5GB/s with DDR3-1066. Even the six-core Opteron with DDR2-800 can reach over 8GB/s, while the newest Opteron DDR3 memory controller achieves 9.5GB/s with DDR3-1333, almost twice as much as the Xeon 7500 series. The best single-threaded performance comes out of the Xeon 5600 memory controller: 12GB/s with DDR3-1333. Intel clearly had to sacrifice some bandwidth too to achieve the enormous memory capacity (64 slots and 1TB without "extensions"). Let's look at latency.

CPU Speed (GHz) L1 (clocks) L2 (clocks) L3 (clocks) Memory (ns)
Xeon X5670 2.93 4 10 56 87
Xeon X5570 2.80 4 9 47 81
Opteron 6174 2.2 3 16 57 98
Opteron 2435 2.6 3 16 56 113
Xeon X7560 2.26 4 9 63 160

The L3 cache latency of our Xeon X7560 is very impressive, considering that we are talking about a 24MB L3. Memory latency clearly suffers from the serial-buffer-parallel DRAM transitions. We also did a cache bandwidth test with SiSoft Sandra 2010.

CPU Speed (GHz) L1 CPU (GB/s) L2 CPU (GB/s) L3 (GB/s)
Xeon X5670 2.93 717 539 150
Xeon X5570 2.80 437 312 114
Opteron 6174 2.2 768 378 194
Opteron 2435 2.6 472 281 228
Xeon X7560 2.26 667 502 275

The most interesting number here is the L3 cache since all cores must access it, and it matters for almost all applications. The throughput of the L1 and L2 caches is mostly important for the few embarrassingly parallel applications. And here we see that the extra engineer on the Nehalem EX pays off: it clearly has the fastest L3 cache. The Opteron are the second fastest, but the exclusive nature of the L3 caches may need quite a bit more bandwidth. In a nutshell: the Xeon 7500 comes with probably the best L3 cache on the market, but the memory subsystem is quite a bit slower than on other server CPU systems.

Benchmark Methods and Systems Decision Support benchmark: Nieuws.be
POST A COMMENT

23 Comments

View All Comments

  • JohanAnandtech - Tuesday, April 13, 2010 - link

    "Damn, Dell cut half the memory channels from the R810!"

    You read too fast again :-). Only in Quad CPU config. In dual CPU config, you get 4 memory controllers, which connect each two SMBs. So in a dual Config, you get the same bandwidth as you would in another server.

    The R810 targets those that are not after the highest CPU processing power, but want the RAS features and 32 DIMM slots. AFAIK,
    Reply
  • whatever1951 - Tuesday, April 13, 2010 - link

    2 channels of DDR3-1066 per socket in a fully populated R810 and if you populate 2 sockets, you get the flex memory routing penalty...damn..............!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! R810 sucks. Reply
  • Sindarin - Tuesday, April 13, 2010 - link

    whatever1951 you lost me @ Hello.........................and I thought Sauron was tough!! lol Reply
  • JohanAnandtech - Tuesday, April 13, 2010 - link

    "It is hard to imagine 4 channels of DDR3-1066 to be 1/3 slower than even the westmere-eps."

    On one side you have a parallel half duplex DDR-3 DIMM. On the other side of the SMB you have a serial full duplex SMI. The buffers might not perform this transition fast enough, and there has to be some overhead. I also am still searching for the clockspeed of the IMC. The SMIs are on a different (I/O) clockdomain than the L3-cache.

    We will test with Intel's / QSSC quad CPU to see whether the flexmem bridge has any influence. But I don't think it will do much. You might add a bit of latency, but essentially the R810 is working like a dual CPU with four IMCs just like another (Dual CPU) Nehalem EX server system would.
    Reply
  • whatever1951 - Tuesday, April 13, 2010 - link

    Thanks for the useful info. R810 then doesn't meet my standard.

    Johan, is there anyway you can get your hands on a R910 4 Processor system from Dell and bench the memory bandwidth to see how much that flex mem chip costs in terms of bandwidth?
    Reply
  • IntelUser2000 - Tuesday, April 13, 2010 - link

    The Uncore of the X7560 runs at 2.4GHz. Reply
  • JohanAnandtech - Wednesday, April 14, 2010 - link

    Do you have a source for that? Must have missed it. Reply
  • Etern205 - Thursday, April 15, 2010 - link

    I think AT needs to fix this "RE:RE:RE...:" problem? Reply
  • amalinov - Wednesday, April 14, 2010 - link

    Great article! I like the way in witch you describe the memory subsystem - I have readed the Intel datasheets and many news articles about Xeon 7500, but your description is the best so far.

    You say "So each CPU has two memory interfaces that connect to two SMBs that can each drive two channels with two DIMMS. Thus, each CPU supports eight registered DDR3 DIMMs ...", but if I do the math it seems: 2 SMIs x 2 SMBs x 2 channels x 2 DIMMs = 16 DDR3 DIMMs, not 8 as written in the second sentence. Later in the article I think you mention 16 at different places, so it seems it is realy 16 and not 8.

    What about Itanium 9300 review (including general background on the plans of OEMs/Intel for IA-64 platform)? Comparision of scalability(HT/QPI)/memory/RAS features of Xeon 7500, Itanium 9300 and Opteron 6000 would be welcome. Also I would like to see a performance comparision with appropriate applications for the RISC mainframe market (HPC?) with 4- and 8-socket AMD, Intel Xeon, Intel Itanium, POWER7, newest SPARC.
    Reply
  • jeha - Thursday, April 15, 2010 - link

    You really should review the IBM 3850 X5 I think?

    They have some interesting solutions when it comes to handling memory expansions etc.
    Reply

Log in

Don't have an account? Sign up now