Memory Subsystem Bandwidth

Let's set the stage first and perform some meaningful low level benchmarks. First, we measured the memory bandwidth in Linux. The binary was compiled with the Open64 compiler 5.0 (Opencc). It is a multi-threaded, OpenMP based, 64-bit binary. The following compiler switches were used:

-Ofast -mp -ipa

The results are expressed in GB per second. Note that we also tested with gcc 4.8.1 and compiler options

-O3 –fopenmp –static

Results were consistently 20% to 30% lower with gcc, so we feel our choice of Open64 is appropriate. Everybody can reproduce our results (Open64 is freely available) and since the binary is capable of reaching higher speeds, it is easier to spot speed differences. First we compared our DDR4-2133 LRDIMMs with the Registered DDR4-2133 DIMMs on the Xeon E5-2695 v3 (14 cores at 2.3GHz, Turbo up to 3.6GHz).

Stream Triad LR vs Registered

Registered DIMMs are slightly faster at 1DPC, but LRDIMMs are clearly faster when you insert more than one DIMM per channel. We measured a 16% to 18% difference in performance. It's interesting to note that LRDIMMs are supposed to run at 1600 at 3DPC according to Intel's documentation, but our bandwidth measurement points to 1866. The command "dmidecode -type 17" that reads out the BIOS confirmed this.

Next, we compared the different Xeon platforms.

Stream Triad

The new Xeon E5-2600 v3 has access to 15-21% more bandwidth than the E5-2600 v2, which uses DDR3-1866, and almost 50% more than the first Xeon E5s (DDR3-1600). Interestingly, the previous generation Xeons and the Xeon E5-2667 v3 need to use one thread per logical thread to use the full potential of the memory controller. The reason that the Xeon E5-2667 v3 shows similar behavior as the previous Xeons is that it is also a die with one dual ring and one memory controller. Also, 16 threads (one per physical core) is probably not enough to get the full potential of a quad channel DDR4-2133 memory subsystem. The new High Core Count (HCC, 14-18 core) Xeon E5 chips perform better with one thread per physical processor.

Although it makes sense that a CPU needs a certain number of threads to get its memory controller working at full speed, it's still interesting to note that the previous 12-core Xeon E5-2697 v2 can only offer 41GB/s at 24 threads while the 14-core Xeon E5-2695 v3 is already delivering more than twice as much bandwidth at 28 threads. Of course, those kind of bandwidth numbers only matter for specific HPC benchmarks as the L3 cache (30-45MB L3) will take care of most of the requests. Latency however always matters.

Benchmark Configuration and Methodology Memory Subsystem: Latency
Comments Locked

85 Comments

View All Comments

  • cmikeh2 - Monday, September 8, 2014 - link

    In the SKU comparison table you have the E5-2690V2 listed as a 12/24 part when it is in fact a 10/20 part. Just a tiny quibble. Overall a fantastic read.
  • KAlmquist - Monday, September 8, 2014 - link

    Also, the 2637 v2 is 4/8, not 6/12.
  • isa - Monday, September 8, 2014 - link

    Looking forward to a new supercomputer record using these behemoths.
  • Bruce Allen - Monday, September 8, 2014 - link

    Awesome article. I'd love to see Cinebench and other applications tests. We do a lot of rendering (currently with older dual Xeons) and would love to compare these new Xeons versus the new 5960X chips - software license costs per computer are so high that the 5960X setups will need much higher price/performance to be worth it. We actually use Cinema 4D in production so those scores are relevant. We use V-Ray, Mental Ray and Arnold for Maya too but in general those track with the Cinebench scores so they are a decent guide. Thank you!
  • Ian Cutress - Monday, September 8, 2014 - link

    I've got some E5 v3 Xeons in for a more workstation oriented review. Look out for that soon :)
  • fastgeek - Monday, September 8, 2014 - link

    From my notes a while back... two E5-2690 v3's (all cores + turbo enabled) under 2012 Server yielded 3,129 for multithreaded and 79 for single.

    While not Haswell, I can tell you that four E5-4657L V2's returned 4,722 / 94 respectively.

    Hope that helps somewhat. :-)
  • fastgeek - Monday, September 8, 2014 - link

    I don't see a way to edit my previous comment; but those scores were from Cinebench R15
  • wireframed - Saturday, September 20, 2014 - link

    You pay for licenses for render Nodes? Switch to 3DS, and you get 9999 nodes for free (unless they changed the licensing since I last checked). :)
  • Lone Ranger - Monday, September 8, 2014 - link

    You make mention that the large core count chips are pretty good about raising their clock rate when only a few cores are active. Under Linux, what is the best way to see actual turbo frequencies? cpuinfo doesn't show live/actual clock rate.
  • JohanAnandtech - Monday, September 8, 2014 - link

    The best way to do this is using Intel's PCM. However, this does not work right now (only on Sandy and Ivy, not Haswel) . I deduced it from the fact that performance was almost identical and previous profiling of some of our benchmarks.

Log in

Don't have an account? Sign up now