Memory Subsystem: TinyMemBench

We doublechecked our LMBench numbers with Andrei's custom memory latency test.

The latency tool also measures bandwidth and it became clear than once we move beyond 16 MB, DRAM is accessed. When Andrei compared with our Ryzen 9 3900x numbers, he noted: 

The prefetchers on the Rome platform don't look nearly as aggressive as on the Ryzen unit on the L2 and L3

It would appear that parts of the prefetchers are adjusted for Rome compared to Ryzen 3000. In effect, the prefetchers are less aggressive than on the consumer parts, and we believe that AMD has made this choice by the fact that quite a few applications (Java and HPC) suffer a bit if the prefetchers take up too much bandwidth. By making the prefetchers less aggressive in Rome, it could aid performance in those tests. 

While we could not retest all our servers with Andrei's memory latency test by the deadline (see the "Murphy's Law" section on page 5), we turned to our open source TinyMemBench benchmark results. The source was compiled for x86 with GCC and the optimization level was set to "-O3". The measurement is described well by the manual of TinyMemBench:

Average time is measured for random memory accesses in the buffers of different sizes. The larger the buffer, the more significant the relative contributions of TLB, L1/L2 cache misses, and DRAM accesses become. All the numbers represent extra time, which needs to be added to L1 cache latency (4 cycles).

We tested with dual random read, as we wanted to see how the memory system coped with multiple read requests. 

The graph shows how the larger L3 cache of the EPYC 7742 resulting in a much lower latency between 4 and 16 MB, compared to the EPYC 7601. The L3 cache inside the CCX is also very fast (2-8 MB) compared to Intel's Mesh (8280) and Ring topologies (E5). 

However, once we access more than 16 MB, Intel has a clear advantage due to the slower but much larger shared L3 cache. When we tested the new EPYC CPUs in a more advanced NUMA setting (with NPS = 4 setting, meaning 4 nodes per socket), latency at 64 MB lowered from 129 to 119. We quote AMD's engineering:

In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest.  Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

So that also explains why AMD states that select workloads achieve better performance with NPS = 4. 

Memory Subsystem: Latency Single-Thread SPEC CPU2006
Comments Locked

180 Comments

View All Comments

  • Zoolook - Saturday, August 10, 2019 - link

    It's been a pretty good investment for me, bought at 8$ two years ago, seems like I'll keep it for a while longer.
  • CheapSushi - Wednesday, August 7, 2019 - link

    It's glorious...one might say.... even EPYC.
  • abufrejoval - Wednesday, August 7, 2019 - link

    Hard to believe a 64 core CPU can be had for the price of a used middle class car or the price of four GTX 2080ti.

    Of course once you add 2TB of RAM and as many PCIe 4 SSDs as those lanes will feed, it no longer feels that affordable.

    There is a lot of clouds still running ancient Sandy/Ivy Bridge and Haswell CPUs: I guess replacing those will eat quite a lot of chips.

    And to think that it's the very same 8-core part that powers the engire range: That stroke of simplicity and genius took so many years of planning ahead and staying on track during times when AMD was really not doing well. Almost makes you believe that corporations owned by share holders can actually sometimes actually execute a strategy, without Facebook type voting rights.

    Raising my coffee mug in a salute!
  • schujj07 - Thursday, August 8, 2019 - link

    Sandy Bridge maxed out at 8c/16t.
    Ivy Bridge maxed out at 15c/30t.
    Haswell maxed out at 18c/36t.
    That means that a single socket Epyc 64c/128t can give you more CPU cores than a quad socket Sandy Bridge (32c/64t) or Ivy Bridge (60c/120t) and only a few less cores that a quad socket Haswell (72c/144t).
  • Eris_Floralia - Wednesday, August 7, 2019 - link

    This is what we've all been waiting for!
  • Eris_Floralia - Wednesday, August 7, 2019 - link

    Thank you for all the work!
  • quorm - Wednesday, August 7, 2019 - link

    Given the range of configurations and prices here, I don't see much room for threadripper. Maybe 16 - 32 cores with higher clock speeds? Really wondering what a new threadripper can bring to the table.
  • willis936 - Wednesday, August 7, 2019 - link

    A reduced feature set and lower prices, namely.
  • quorm - Wednesday, August 7, 2019 - link

    Reduced in what way, though? I'm assuming threadripper will be 4 chiplets, 64 pcie lanes, single socket only. All ryzen support ecc.

    So, what can it offer? At 32 cores, 8 channel memory becomes useful for a lot of workloads. Seems like a lot of professionals would just choose epyc this time. On the other end, I don't think any gamers need more than a 3900x/3950x. Is threadripper just going to be for bragging rights?
  • quorm - Wednesday, August 7, 2019 - link

    Sorry, forgot to add, 3950x is $750, epyc 7302p is $825. Where is threadripper going to fit?

Log in

Don't have an account? Sign up now