Memory Subsystem: TinyMemBench

We doublechecked our LMBench numbers with Andrei's custom memory latency test.

The latency tool also measures bandwidth and it became clear than once we move beyond 16 MB, DRAM is accessed. When Andrei compared with our Ryzen 9 3900x numbers, he noted: 

The prefetchers on the Rome platform don't look nearly as aggressive as on the Ryzen unit on the L2 and L3

It would appear that parts of the prefetchers are adjusted for Rome compared to Ryzen 3000. In effect, the prefetchers are less aggressive than on the consumer parts, and we believe that AMD has made this choice by the fact that quite a few applications (Java and HPC) suffer a bit if the prefetchers take up too much bandwidth. By making the prefetchers less aggressive in Rome, it could aid performance in those tests. 

While we could not retest all our servers with Andrei's memory latency test by the deadline (see the "Murphy's Law" section on page 5), we turned to our open source TinyMemBench benchmark results. The source was compiled for x86 with GCC and the optimization level was set to "-O3". The measurement is described well by the manual of TinyMemBench:

Average time is measured for random memory accesses in the buffers of different sizes. The larger the buffer, the more significant the relative contributions of TLB, L1/L2 cache misses, and DRAM accesses become. All the numbers represent extra time, which needs to be added to L1 cache latency (4 cycles).

We tested with dual random read, as we wanted to see how the memory system coped with multiple read requests. 

The graph shows how the larger L3 cache of the EPYC 7742 resulting in a much lower latency between 4 and 16 MB, compared to the EPYC 7601. The L3 cache inside the CCX is also very fast (2-8 MB) compared to Intel's Mesh (8280) and Ring topologies (E5). 

However, once we access more than 16 MB, Intel has a clear advantage due to the slower but much larger shared L3 cache. When we tested the new EPYC CPUs in a more advanced NUMA setting (with NPS = 4 setting, meaning 4 nodes per socket), latency at 64 MB lowered from 129 to 119. We quote AMD's engineering:

In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest.  Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

So that also explains why AMD states that select workloads achieve better performance with NPS = 4. 

Memory Subsystem: Latency Single-Thread SPEC CPU2006
POST A COMMENT

185 Comments

View All Comments

  • negusp - Wednesday, August 7, 2019 - link

    hard F in the chat for intel Reply
  • pancakes - Wednesday, August 7, 2019 - link

    F in chat for wallets of people running Windows server Reply
  • azfacea - Wednesday, August 7, 2019 - link

    windows server in 2019 LUL Reply
  • diehardmacfan - Wednesday, August 7, 2019 - link

    on-prem Windows Server is probably at an all time high in 2019? Reply
  • azfacea - Thursday, August 8, 2019 - link

    desperate for a comeback huh? cool hold your 10% tight and gloat about upcoming bfloat16 Reply
  • diehardmacfan - Thursday, August 8, 2019 - link

    Sorry, who is desperate for a comeback? Bring up a floating point format when called out on the ridiculous notion that Windows Server isn't still a large part of the marketplace? say wha Reply
  • mkaibear - Thursday, August 8, 2019 - link

    Just hopping in to say that I am an IT manager for a major employer in the UK and of our 1800 servers more than 80% of them are Windows... this is not a trend which I see changing any time soon. Reply
  • npz - Thursday, August 8, 2019 - link

    Many smaller IT depts in smaller companies use Windows because of familiarity for desktop support such as Active Directory for domains, but none of major critical data center centric, HPC, military, infrastructure are running Windows. Most especially not with EPYC since the Windows scheduler is broken. Reply
  • Manch - Thursday, August 8, 2019 - link

    NPZ, You may be speaking for your bubble, but not for the rest. Reply
  • blaktron - Thursday, August 8, 2019 - link

    this is 100% false. I do infrastructure consulting for 9 figure companies and they are all primarily windows in their corporate infrastructure. all of them. the only linux you will find in the Fortune 50 is legacy applications and web presentation layer. There are exceptions, but that's true enough to form a rule. Reply

Log in

Don't have an account? Sign up now