Memory Subsystem Bandwidth

Let's set the stage first and perform some meaningful low level benchmarks. First, we measured the memory bandwidth in Linux. The binary was compiled with the Open64 compiler 5.0 (Opencc). It is a multi-threaded, OpenMP based, 64-bit binary. The following compiler switches were used:

-Ofast -mp -ipa

The results are expressed in GB per second. Note that we also tested with gcc 4.8.1 and compiler options

-O3 –fopenmp –static

Results were consistently 20% to 30% lower with gcc, so we feel our choice of Open64 is appropriate. Everybody can reproduce our results (Open64 is freely available) and since the binary is capable of reaching higher speeds, it is easier to spot speed differences. First we compared our DDR4-2133 LRDIMMs with the Registered DDR4-2133 DIMMs on the Xeon E5-2695 v3 (14 cores at 2.3GHz, Turbo up to 3.6GHz).

Stream Triad LR vs Registered

Registered DIMMs are slightly faster at 1DPC, but LRDIMMs are clearly faster when you insert more than one DIMM per channel. We measured a 16% to 18% difference in performance. It's interesting to note that LRDIMMs are supposed to run at 1600 at 3DPC according to Intel's documentation, but our bandwidth measurement points to 1866. The command "dmidecode -type 17" that reads out the BIOS confirmed this.

Next, we compared the different Xeon platforms.

Stream Triad

The new Xeon E5-2600 v3 has access to 15-21% more bandwidth than the E5-2600 v2, which uses DDR3-1866, and almost 50% more than the first Xeon E5s (DDR3-1600). Interestingly, the previous generation Xeons and the Xeon E5-2667 v3 need to use one thread per logical thread to use the full potential of the memory controller. The reason that the Xeon E5-2667 v3 shows similar behavior as the previous Xeons is that it is also a die with one dual ring and one memory controller. Also, 16 threads (one per physical core) is probably not enough to get the full potential of a quad channel DDR4-2133 memory subsystem. The new High Core Count (HCC, 14-18 core) Xeon E5 chips perform better with one thread per physical processor.

Although it makes sense that a CPU needs a certain number of threads to get its memory controller working at full speed, it's still interesting to note that the previous 12-core Xeon E5-2697 v2 can only offer 41GB/s at 24 threads while the 14-core Xeon E5-2695 v3 is already delivering more than twice as much bandwidth at 28 threads. Of course, those kind of bandwidth numbers only matter for specific HPC benchmarks as the L3 cache (30-45MB L3) will take care of most of the requests. Latency however always matters.

Benchmark Configuration and Methodology Memory Subsystem: Latency
Comments Locked

85 Comments

View All Comments

  • LostAlone - Saturday, September 20, 2014 - link

    Given the difference in size between the two companies it's not really all that surprising though. Intel are ten times AMD's size, and I have to imagine that Intel's chip R&D department budget alone is bigger than the whole of AMD. And that is sad really, because I'm sure most of us were learning our computer science when AMD were setting the world on fire, so it's tough to see our young loves go off the rails. But Intel have the money to spend, and can pursue so many more potential avenues for improvement than AMD and that's what makes the difference.
  • Kevin G - Monday, September 8, 2014 - link

    I'm actually surprised they released the 18 core chip for the EP line. In the Ivy Bridge generation, it was the 15 core EX die that was harvested for the 12 core models. I was expecting the same thing here with the 14 core models, though more to do with power binning than raw yields.

    I guess with the recent TSX errata, Intel is just dumping all of the existing EX dies into the EP socket. That is a good means of clearing inventory of a notably buggy chip. When Haswell-EX formally launches, it'll be of a stepping with the TSX bug resolved.
  • SanX - Monday, September 8, 2014 - link

    You have teased us with the claim that added FMA instructions have double floating point performance. Wow! Is this still possible to do that with FP which are already close to the limit approaching just one clock cycle? This was good review of integer related performance but please combine with Ian to continue with the FP one.
  • JohanAnandtech - Monday, September 8, 2014 - link

    Ian is working on his workstation oriented review of the latest Xeon
  • Kevin G - Monday, September 8, 2014 - link

    FMA is common place in many RISC architectures. The reason why we're just seeing it now on x86 is that until recently, the ISA only permitted two registers per operand.

    Improvements in this area maybe coming down the line even for legacy code. Intel's micro-op fusion has the potential to take an ordinary multiply and add and fuse them into one FMA operation internally. This type of optimization is something I'd like to see in a future architecture (Sky Lake?).
  • valarauca - Monday, September 8, 2014 - link

    The Intel compiler suite I believe already converts

    x *= y;
    x += z;

    into an FMA operation when confronted with them.
  • Kevin G - Monday, September 8, 2014 - link

    That's with source that is going to be compiled. (And don't get me wrong, that's what a compiler should do!)

    Micro-op fusion works on existing binaries years old so there is no recompile necessary. However, micro-op fusion may not work in all situations depending on the actual instruction stream. (Hypothetically the fusion of a multiply and an add in an instruction stream may have to be adjacent to work but an ancient compiler could have slipped in some other instructions in between them to hide execution latencies as an optimization so it'd never work in that binary.)
  • DIYEyal - Monday, September 8, 2014 - link

    Very interesting read.
    And I think I found a typo: page 5 (power optimization). It is well known that THE (not needed) Haswell HAS (is/ has been) optimized for low idle power.
  • vLsL2VnDmWjoTByaVLxb - Monday, September 8, 2014 - link

    Colors or labeling for your HPC Power Consumption graph don't seem right.
  • JohanAnandtech - Monday, September 8, 2014 - link

    Fixed, thanks for pointing it out.

Log in

Don't have an account? Sign up now