Memory Subsystem: Bandwidth

For this review we completely overhauled our testing of John McCalpin's Stream bandwidth benchmark. We compiled the stream 5.10 source code with the Intel compiler for linux version 16 or gcc 4.8.4, both 64 bit. The following compiler switches were used on icc:

 -fast  -openmp  -parallel

The results are expressed in GB per second. The following compiler switches were used on gcc:

-O3 –fopenmp –static

Stream allows us to estimate the maximum performance increase that DDR-2400 (Xeon E5 v4) can offer over DDR-2133 (Xeon E5 v3). 

Stream Triad

The Xeon E5 v4 with DDR4-2400 delivers about 15% higher performance then the v3 when we compile Stream with icc. To put this into perspective: DDR-4 @ 1600 delivered 80 GB/s. 

The difference between DDR-4 2400 and DDR-4 2133 is negligible with gcc.  

Memory Subsystem: Latency

To measure latency, we use the open source TinyMemBench benchmark. The source was compiled for x86 with gcc 4.8.2 and optimization was set to "-O2". The measurement is described well by the manual of TinyMemBench:

Average time is measured for random memory accesses in the buffers of different sizes. The larger the buffer, the more significant the relative contributions of TLB, L1/L2 cache misses, and DRAM accesses become. All the numbers represent extra time, which needs to be added to L1 cache latency (4 cycles).

We tested with dual random read, as we wanted to see how the memory system coped with multiple read requests. 

The larger the L3 caches get, the higher the latency. Latency has almost doubled from the Xeon E5 v1 to the Xeon E5 v4 while capacity has almost tripled (55 MB vs 20 MB). Still, this will result in a small performance hit in many non-virtualized applications that do no need such a large L3. 

Single Core Integer Performance With SPEC CPU2006 Multi-Threaded Integer Performance


View All Comments

  • iwod - Thursday, March 31, 2016 - link

    Maximum memory still 768GB?
    What happen to the 5.1Ghz Xeon E5?
  • Ian Cutress - Thursday, March 31, 2016 - link

    I never saw anyone with a confirmed source for that, making me think it's a faked rumor. I'll happily be proved wrong, but nothing like a 5.1 GHz part was announced today. Reply
  • Brutalizer - Saturday, April 02, 2016 - link

    It would have been interesting to bench to the best cpu today, the SPARC M7. For instance:

    -SAP: two M7 cpu scores 169.000 saps vs 109.000 saps for two of this Broadwell-EP cpus

    -Hadoop, sort 10TB data: one SPARC M7 server with four cpus, finishes the sort in 4,260 seconds. Whereas a cluster of 32 PCs equipped with dual E5-2680v2 finishes in 1,054 seconds, i.e. 64 Intel Xeon cpus vs four SPARC M7 cpus.

    -TPC-C: one SPARC M7 server with one cpu gets 5,000,000 tpm, whereas one server with two E5-2699v3 cpus gets 3.600.000 tpm

    -Memory bandwidth, Stream triad: one SPARC M7 reaches 145 GB/sec, whereas two of these Broadwell-EP cpus reaches 119GB/sec

    -etc. All these benchmarks can be found here, and another 25ish benchmarks where SPARC M7 is 2-3x faster than E5-2699v3 or POWER8 (all the way up to 11x faster):
  • Brutalizer - Saturday, April 02, 2016 - link

    BTW, all these SPARC M7 benchmarks are almost unaffected if encryption is turned on, maybe 2-5% slower. Whereas if you turn on encryption for x86 and POWER8, expect performance to halve or even less. Just check the benchmarks on the link above, and you will see that SPARC M7 benchmarks are almost unaffected encrypted or not. Reply
  • JohanAnandtech - Saturday, April 02, 2016 - link

    "if you turn on encryption for x86 and POWER8, expect performance to halve or even less". And this is based upon what measurement? from my measurements, both x86 and POWER8 loose like 1-3% when AES encryption is on. RSA might be a bit worse (2-10%), but asymetric encryption is mostly used to open connections. Reply
  • Brutalizer - Wednesday, April 06, 2016 - link

    If we talk about how encryption affects performance, lets look at this benchmark below. Never mind the x86 is slower than the SPARC M7, let us instead look at how encryption affects the cpus. What performance hit has encryption?

    -For x86 we see that two E5-2699v3 cpus utilization goes from 40% without crypto, up to 80% with crypto. This leaves the x86 server with very little headroom to do anything else than executing one query. At the same time, the x86 server took 25-30% longer time to process the query. This shows that encryption has a huge impact on x86. You can not do useful work with two x86 cpus, except executing a query. If you need to do additional work, get four x86 xeons instead.

    -If we look at how SPARC M7 gets affected by encryption, we see that cpu utilization went up from 30% up to 40%. So you have lot of headroom to do additional work while processing the query. At the same time, the SPARC cpu took 2% longer time to process the query.

    It is not really interesting that this single SPARC M7 cpu is 30% faster than two E5-2699v3 in absolute numbers. No, we are looking at how much worse the performance gets affected when we turn on encryption. In case of x86, we see that the cpus gets twice the load, so they are almost fully loaded, only by turning on encryption. At the same time taking longer time to process the work. Ergo, you can not do any additional work with x86 with crypto. With SPARC, it ends up with 40% cpu utilization so you can do additional work on SPARC, and process time does not increase at all (2%). This proves that x86 encryption halves performance or worse.

    For your own AES encryption benchmark, you should also see how much cpu utilization goes up. If it gets fully loaded, you can not do any useful work except handling encryption. So you need an additional cpu to do the actual work.
  • JohanAnandtech - Saturday, April 02, 2016 - link

    Two M7 machines start at 90k, while a dual Xeon is around 20k. And most of those Oracle are very intellectually dishonest: complicated configurations to get the best out of the M7 machines, midrange older x86 configurations (10-core E5 v2, really???) Reply
  • Brutalizer - Wednesday, April 06, 2016 - link

    The "dishonest" benchmarks from Oracle, are often (always?) using what is published. If for instance, IBM only has one published benchmark, then Oracle has no other choice than use it, right? Of course when there are faster IBM benchmarks out there, Oracle use that. Same with x86. In all these 25ish cases we see that SPARC M7 is 2-3x faster, all the way up to 11x faster. The benhcmarks vary very much, raw compute power, databases, deep learning, SAP, etc etc Reply
  • Phil_Oracle - Thursday, May 12, 2016 - link

    I disagree Johan! You don't appear to know much about the new SPARC M7 systems and suggest you do a full evaluation before making such remarks. A SPARC T7-1 with 32-cores has a list price of about $39K outperforms a 2-socket 36-core E5-2699v3 anywhere from 38% (OLTP HammerDB) to over 8x faster (OLTP w/ in-memory analytics). A similarly configured *enterprise* class 2-socket 36-core E5-2699v3 from HPE or Cisco lists for $25K+, so in terms of price/performance, the SPARC T7-1 beats the 2-socket E5-2699v3. And if you take into account SW that’s licensed per core, the SPARC M7 is 60% to 2.6x faster/core, dramatically lowering licensing costs. With the new E5-2699v4, providing ~20% more cores at roughly the same price, gets closer, but with performance/core not changing much with E5 v4, SPARC M7 still has a huge lead. And the difference is while the E5 v3/v4 chips don't scale beyond 2-socket, you can get an SPARC M7 system up to 16-sockets with the almost identical price/performance of the 1-socket system. Reply
  • adamod - Friday, June 03, 2016 - link


Log in

Don't have an account? Sign up now