Memory Subsystem Bandwidth

Let's set the stage first and perform some meaningful low level benchmarks. First, we measured the memory bandwidth in Linux. The binary was compiled with the Open64 compiler 5.0 (Opencc). It is a multi-threaded, OpenMP based, 64-bit binary. The following compiler switches were used:

-Ofast -mp -ipa

The results are expressed in GB per second. Note that we also tested with gcc 4.8.1 and compiler options

-O3 –fopenmp –static

Results were consistently 20% to 30% lower with gcc, so we feel our choice of Open64 is appropriate. Everybody can reproduce our results (Open64 is freely available) and since the binary is capable of reaching higher speeds, it is easier to spot speed differences. First we compared our DDR4-2133 LRDIMMs with the Registered DDR4-2133 DIMMs on the Xeon E5-2695 v3 (14 cores at 2.3GHz, Turbo up to 3.6GHz).

Stream Triad LR vs Registered

Registered DIMMs are slightly faster at 1DPC, but LRDIMMs are clearly faster when you insert more than one DIMM per channel. We measured a 16% to 18% difference in performance. It's interesting to note that LRDIMMs are supposed to run at 1600 at 3DPC according to Intel's documentation, but our bandwidth measurement points to 1866. The command "dmidecode -type 17" that reads out the BIOS confirmed this.

Next, we compared the different Xeon platforms.

Stream Triad

The new Xeon E5-2600 v3 has access to 15-21% more bandwidth than the E5-2600 v2, which uses DDR3-1866, and almost 50% more than the first Xeon E5s (DDR3-1600). Interestingly, the previous generation Xeons and the Xeon E5-2667 v3 need to use one thread per logical thread to use the full potential of the memory controller. The reason that the Xeon E5-2667 v3 shows similar behavior as the previous Xeons is that it is also a die with one dual ring and one memory controller. Also, 16 threads (one per physical core) is probably not enough to get the full potential of a quad channel DDR4-2133 memory subsystem. The new High Core Count (HCC, 14-18 core) Xeon E5 chips perform better with one thread per physical processor.

Although it makes sense that a CPU needs a certain number of threads to get its memory controller working at full speed, it's still interesting to note that the previous 12-core Xeon E5-2697 v2 can only offer 41GB/s at 24 threads while the 14-core Xeon E5-2695 v3 is already delivering more than twice as much bandwidth at 28 threads. Of course, those kind of bandwidth numbers only matter for specific HPC benchmarks as the L3 cache (30-45MB L3) will take care of most of the requests. Latency however always matters.

Benchmark Configuration and Methodology Memory Subsystem: Latency
Comments Locked

85 Comments

View All Comments

  • SuperVeloce - Tuesday, September 9, 2014 - link

    Oh, nevermind... I unknowingly caught an error.
  • JohanAnandtech - Tuesday, September 9, 2014 - link

    thx! Fixed. Sorry for the late reaction, jetlagged and trying to get to the hectic pace of IDF :-)
  • hescominsoon - Tuesday, September 9, 2014 - link

    As long as AMD continues it's idiotic two integer units sharing an fpu design they will be an afterthought in the cpu department.
  • nils_ - Sunday, September 14, 2014 - link

    Serious competition for Intel will not come from AMD any time soon, but possibly IBM with the POWER8, Tyan even came out with a single socket board for that CPU so it might make it's way into the same market soon.
  • ScarletEagle - Tuesday, September 16, 2014 - link

    Any feel for the relative HPC performance of the E5-2680v3 with respect to the E5-2650Lv3? I am looking at purchasing a PowerEdge 730 with two of these and the 2133MHz RAM. My guess is that the higher base clock speed should make somewhat of an improvement?

Log in

Don't have an account? Sign up now