Memory Subsystem: TinyMemBench

We doublechecked our LMBench numbers with Andrei's custom memory latency test.

The latency tool also measures bandwidth and it became clear than once we move beyond 16 MB, DRAM is accessed. When Andrei compared with our Ryzen 9 3900x numbers, he noted: 

The prefetchers on the Rome platform don't look nearly as aggressive as on the Ryzen unit on the L2 and L3

It would appear that parts of the prefetchers are adjusted for Rome compared to Ryzen 3000. In effect, the prefetchers are less aggressive than on the consumer parts, and we believe that AMD has made this choice by the fact that quite a few applications (Java and HPC) suffer a bit if the prefetchers take up too much bandwidth. By making the prefetchers less aggressive in Rome, it could aid performance in those tests. 

While we could not retest all our servers with Andrei's memory latency test by the deadline (see the "Murphy's Law" section on page 5), we turned to our open source TinyMemBench benchmark results. The source was compiled for x86 with GCC and the optimization level was set to "-O3". The measurement is described well by the manual of TinyMemBench:

Average time is measured for random memory accesses in the buffers of different sizes. The larger the buffer, the more significant the relative contributions of TLB, L1/L2 cache misses, and DRAM accesses become. All the numbers represent extra time, which needs to be added to L1 cache latency (4 cycles).

We tested with dual random read, as we wanted to see how the memory system coped with multiple read requests. 

The graph shows how the larger L3 cache of the EPYC 7742 resulting in a much lower latency between 4 and 16 MB, compared to the EPYC 7601. The L3 cache inside the CCX is also very fast (2-8 MB) compared to Intel's Mesh (8280) and Ring topologies (E5). 

However, once we access more than 16 MB, Intel has a clear advantage due to the slower but much larger shared L3 cache. When we tested the new EPYC CPUs in a more advanced NUMA setting (with NPS = 4 setting, meaning 4 nodes per socket), latency at 64 MB lowered from 129 to 119. We quote AMD's engineering:

In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest.  Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

So that also explains why AMD states that select workloads achieve better performance with NPS = 4. 

Memory Subsystem: Latency Single-Thread SPEC CPU2006
Comments Locked


View All Comments

  • negusp - Wednesday, August 7, 2019 - link

    hard F in the chat for intel
  • pancakes - Wednesday, August 7, 2019 - link

    F in chat for wallets of people running Windows server
  • azfacea - Wednesday, August 7, 2019 - link

    windows server in 2019 LUL
  • diehardmacfan - Wednesday, August 7, 2019 - link

    on-prem Windows Server is probably at an all time high in 2019?
  • azfacea - Thursday, August 8, 2019 - link

    desperate for a comeback huh? cool hold your 10% tight and gloat about upcoming bfloat16
  • diehardmacfan - Thursday, August 8, 2019 - link

    Sorry, who is desperate for a comeback? Bring up a floating point format when called out on the ridiculous notion that Windows Server isn't still a large part of the marketplace? say wha
  • mkaibear - Thursday, August 8, 2019 - link

    Just hopping in to say that I am an IT manager for a major employer in the UK and of our 1800 servers more than 80% of them are Windows... this is not a trend which I see changing any time soon.
  • Deshi! - Thursday, August 8, 2019 - link

    I work as an application engineer for a major global finance company that develops and hosts banking and e-commerce software used by banks and major shopping outlets. 90% of all our servers are either Linux or AIX mainly running websphere or standalone Java instances. We only have a handful of Windows servers, mainly for stuff like active directory and Outlook/ SharePoint. So yeah allot of it depends on the use case, but allot of the big boys do use Linux or AIX. It's cheaper and performs better for these use cases.
  • cyberguyz - Thursday, August 8, 2019 - link

    I guess we all have to ask ourselves, who are the customers that would benefit most from a 64-core, 128 gen 4 PCIe processors? SMB or huge customers that would shell out many millions of $$$ for their middleware & backend systems? @Deshi! I or one of my L3 colleagues an L3 engineer contacted by your global finance company to fix Websphere problems some years back ;)
  • FreckledTrout - Thursday, August 8, 2019 - link

    @cyberguz, Who would benefit from these high core servers? Any company running VM's so pretty much every large company. This goes doubly for cloud providers.

Log in

Don't have an account? Sign up now