Memory Subsystem: Latency

The performance of modern CPUs depends heavily on the cache subsystem. And some applications depend heavily on the DRAM subsystem too. We used LMBench in an effort to try to measure cache and memory latency. The numbers we looked at were "Random load latency stride=16 Bytes".

Mem
Hierarchy
AMD EPYC 7601
DDR4-2400
Intel Skylake-SP
DDR4-2666
Intel Broadwell
Xeon E5-2699v4
DDR4-2400
L1 Cache cycles 4
L2 Cache cycles  12 14-22  12-15
L3 Cache 4-8 MB - cycles 34-47 54-56 38-51
16-32 MB - ns 89-95 ns 25-27 ns
(+/- 55 cycles?)
27-42 ns
(+/- 47 cycles)
Memory 384-512 MB - ns 96-98 ns 89-91 ns 95 ns

Previously, Ian has described the AMD Infinity Fabric that stitches the two CCXes together in one die and interconnects the 4 different "Zeppelin" dies in one MCM. The choice of using two CCXes in a single die is certainly not optimal for Naples. The local "inside the CCX" 8 MB L3-cache is accessed with very little latency. But once the core needs to access another L3-cache chunk – even on the same die – unloaded latency is pretty bad: it's only slightly better than the DRAM access latency. Accessing DRAM is on all modern CPUs a naturally high latency operation: signals have to travel from the memory controller over the memory bus, and the internal memory matrix of DDR4-2666 DRAM is only running at 333 MHz (hence the very high CAS latencies of DDR4). So it is surprising that accessing SRAM over an on-chip fabric requires so many cycles. 

What does this mean to the end user? The 64 MB L3 on the spec sheet does not really exist. In fact even the 16 MB L3 on a single Zeppelin die consists of two 8 MB L3-caches. There is no cache that truly functions as single, unified L3-cache on the MCM; instead there are eight separate 8 MB L3-caches. 

That will work out fine for applications that have a footprint that fits within a single 8 MB L3 slice, like virtual machines (JVM, Hypervisors based ones) and HPC/Big Data applications that work on separate chunks of data in parallel (for example, the "map" phase of "map/reduce"). However this kind of setup will definitely hurt the performance of applications that need "central" access to one big data pool, such as database applications and big data applications in the "Shuffle phase". 

Memory Subsystem: TinyMemBench

To double check our latency measurements and get a deeper understanding of the respective architectures, we also use the open source TinyMemBench benchmark. The source was compiled for x86 with GCC 5.4 and the optimization level was set to "-O3". The measurement is described well by the manual of TinyMemBench:

Average time is measured for random memory accesses in the buffers of different sizes. The larger the buffer, the more significant the relative contributions of TLB, L1/L2 cache misses, and DRAM accesses become. All the numbers represent extra time, which needs to be added to L1 cache latency (4 cycles).

We tested with dual random read, as we wanted to see how the memory system coped with multiple read requests. 

L3-cache sizes have increased steadily over the years. The Xeon E5 v1 had up to 20 MB, v3 came with 45 MB, and v4 "Broadwell EP" further increased this to 55 MB. But the fatter the cache, the higher the latency became. L3 latency doubled from Sandy Bridge-EP to Broadwell-EP.  So it is no wonder that Skylake went for a larger L2-cache and a smaller but faster L3. The L2-cache offers 4 times lower latency at 512 KB. 

AMD's unloaded latency is very competitive under 8 MB, and is a vast improvement over previous AMD server CPUs. Unfortunately, accessing more 8 MB incurs worse latency than a Broadwell core accessing DRAM. Due to the slow L3-cache access, AMD's DRAM access is also the slowest. The importance of unloaded DRAM latency should of course not be exaggerated: in most applications most of the loads are done in the caches. Still, it is bad news for applications with pointer chasing or other latency-sensitive operations. 

Memory Subsystem: Bandwidth Single Threaded Integer Performance: SPEC CPU2006
POST A COMMENT

219 Comments

View All Comments

  • coder543 - Tuesday, July 11, 2017 - link

    The one benchmark that favors Intel is the "most real-world"? Absolutely, I want AnandTech to do further testing, but your comments do not sound unbiased. Reply
  • ddriver - Wednesday, July 12, 2017 - link

    LOL, buthurt intel fanboy claims that the only unbiased benchmark in the review is THE MOST biased benchmark in the review, the one that was done entirely for the puprpose to help intel save face.

    Because if many core servers running 128 gigs of ram are primarily used to run 16 megabyte databases in the real world. That's right!
    Reply
  • Beany2013 - Tuesday, July 11, 2017 - link

    Sure, test against Ubuntu 17.04 if you only plan to have your server running till January. When it goes end of life. That's not a joke - non LTS Ubuntu released get nine months patches and that's it.

    https://wiki.ubuntu.com/Releases

    16.04 is supported till 2021, it's what will be used in production by people who actually *buy* and *use* servers and as such it's a perfectly representative benchmark for people like me who are looking at dropping six figures on this level of hardware soon and want to see how it performs on...goodness, realistic workloads.
    Reply
  • rahvin - Wednesday, July 12, 2017 - link

    This is a silly argument. No one running these is going to be running bleeding edge software, compiling special kernels or putting optimizing compiler flags on anything. Enterprise runs on stable verified software and OS's. Your typical Enterprise Linux install is similar to RHEL 6 or 7 or it's variants (some are still running RHEL 5 with a 2.6 kernel!). Both RHEL6 and 7 have kernels that are 5+ years old and if you go with 6 it's closer to 10 year old.

    Enterprises don't run bleeding edge software or compile with aggressive flags, these things create regressions and difficult to trace bugs that cost time and lots of money. Your average enterprise is going to care about one thing, that's performance/watt running something like a LAMP stack or database on a standard vanilla distribution like RHEL. Any large enterprise is going to take a review like this and use it as data point when they buy a server and put a standard image on it and test their own workloads perf/watt.

    Some of the enterprises who are more fault tolerant might run something as bleeding edge as an Ubuntu Server LTS release. This review is a fair review for the expected audience, yes every writer has a little bias but I'd dare you to find it in this article, because the fanboi's on both sides are complaining that indicates how fair the review is.
    Reply
  • jjj - Tuesday, July 11, 2017 - link

    Do remember that the future is chiplets, even for Intel.
    The 2 are approaching that a bit differently as AMD had more cost constrains so they went with a 4 cores CCX that can be reused in many different prods.

    Highly doubt that AMD ever goes back to a very large die and it's not like Intel could do a monolithic 48 cores on 10nm this year or even next year and that would be even harder in a competitive market. Sure if they had a Cortex A75 like core and a lot less cache, that's another matter but they are so far behind in perf/mm2 that it's hard to even imagine that they can ever be that efficient.
    Reply
  • coder543 - Tuesday, July 11, 2017 - link

    Never heard the term "chiplet" before. I think AMD has adequately demonstrated the advantages (much higher yield -> lower cost, more than adequate performance), but I haven't heard Intel ever announce that they're planning to do this approach. After the embarrassment that they're experiencing now, maybe they will. Reply
  • Ian Cutress - Tuesday, July 11, 2017 - link

    Look up Intel's EMIB. It's an obvious future for that route to take as process nodes get smaller. Reply
  • Threska - Saturday, July 22, 2017 - link

    We may see their interposer (like used with their GPUs) technology being used. Reply
  • jeffsci - Tuesday, July 11, 2017 - link

    Benchmarking NAMD with pre-compiled binaries is pretty silly. If you can't figure out how to compile it for each every processor of interest, you shouldn't be benchmarking it. Reply
  • CajunArson - Tuesday, July 11, 2017 - link

    On top of all that, they couldn't even be bothered to download and install a (completely free) vanilla version that was released this year. Their version of NAMD 2.10 is from *2014*!

    http://www.ks.uiuc.edu/Development/Download/downlo...
    Reply

Log in

Don't have an account? Sign up now