Memory Subsystem: TinyMemBench

We doublechecked our LMBench numbers with Andrei's custom memory latency test.

The latency tool also measures bandwidth and it became clear than once we move beyond 16 MB, DRAM is accessed. When Andrei compared with our Ryzen 9 3900x numbers, he noted: 

The prefetchers on the Rome platform don't look nearly as aggressive as on the Ryzen unit on the L2 and L3

It would appear that parts of the prefetchers are adjusted for Rome compared to Ryzen 3000. In effect, the prefetchers are less aggressive than on the consumer parts, and we believe that AMD has made this choice by the fact that quite a few applications (Java and HPC) suffer a bit if the prefetchers take up too much bandwidth. By making the prefetchers less aggressive in Rome, it could aid performance in those tests. 

While we could not retest all our servers with Andrei's memory latency test by the deadline (see the "Murphy's Law" section on page 5), we turned to our open source TinyMemBench benchmark results. The source was compiled for x86 with GCC and the optimization level was set to "-O3". The measurement is described well by the manual of TinyMemBench:

Average time is measured for random memory accesses in the buffers of different sizes. The larger the buffer, the more significant the relative contributions of TLB, L1/L2 cache misses, and DRAM accesses become. All the numbers represent extra time, which needs to be added to L1 cache latency (4 cycles).

We tested with dual random read, as we wanted to see how the memory system coped with multiple read requests. 

The graph shows how the larger L3 cache of the EPYC 7742 resulting in a much lower latency between 4 and 16 MB, compared to the EPYC 7601. The L3 cache inside the CCX is also very fast (2-8 MB) compared to Intel's Mesh (8280) and Ring topologies (E5). 

However, once we access more than 16 MB, Intel has a clear advantage due to the slower but much larger shared L3 cache. When we tested the new EPYC CPUs in a more advanced NUMA setting (with NPS = 4 setting, meaning 4 nodes per socket), latency at 64 MB lowered from 129 to 119. We quote AMD's engineering:

In NPS4, the NUMA domains are reported to software in such a way as it chiplets always access the near (2 channels) DRAM. In NPS1 the 8ch are hardware-interleaved and there is more latency to get to further ones. It varies by pairs of DRAM channels, with the furthest one being ~20-25ns (depending on the various speeds) further away than the nearest.  Generally, the latencies are +~6-8ns, +~8-10ns, +~20-25ns in pairs of channels vs the physically nearest ones."

So that also explains why AMD states that select workloads achieve better performance with NPS = 4. 

Memory Subsystem: Latency Single-Thread SPEC CPU2006
Comments Locked

180 Comments

View All Comments

  • close - Thursday, August 8, 2019 - link

    VMware licenses per socket. I'm not sure what kind of niche market one would have to be in (maybe HPC on Windows with the HPC Pack?) to run Win server bare metal on this thing. So I'm pretty sure the average cores/VM for Windows servers is relatively low and no reason for concern.
  • schujj07 - Thursday, August 8, 2019 - link

    @deltaFx2 Most people purchase more cores than they currently need so that they can grow. In the long run it is cheaper to purchase a higher SKU right now than purchase a second host a year down the road.
    @close There are companies that are Windows only so they would install Hyper-V onto this host to use as their hypervisor. However, even under VMware if you want to license Windows as a VM you have to pay the per-core licensing for every CPU core on each VM. I looked into getting volume licensing for Server 2016 for the company I work for we have 2 hosts with dual 24 core Epyc 7401's and we would need to get 16 dual core license packs for each instance of Server 2016. It ended up that we couldn't afford to get Sever 2016 because it would have cost us $5k per instance of Server 2016.
  • DigitalFreak - Thursday, August 8, 2019 - link

    @schujj07 Just buy a Windows Server Datacenter license for each host and you don't have to worry about licensing each VM.
  • schujj07 - Thursday, August 8, 2019 - link

    AFAIK it doesn't work that way when you are running VMware. With VMware you will still have to license each one.
  • wolrah - Thursday, August 8, 2019 - link

    @schujj07 nope. Windows Server licensing is the same no matter which hypervisor you're using. Datacenter licenses allow unlimited VMs on any licensed host.
  • diehardmacfan - Thursday, August 8, 2019 - link

    This is correct. You do need to buy the licenses to match the core count of the hypervisor, however.
  • Dug - Friday, August 9, 2019 - link

    You still have to pay for cores on datacenter. Each datacenter license covers 2 cores with a minimum purchase of 8. So over 8 cores and you are buying more licenses. 64 cores is about $25k
  • MDD1963 - Friday, August 9, 2019 - link

    Windows license (Standard or Datacenter) covers 2 *sockets* for, a total of 16 cores....; if you have more than 2 sockets, you need more licenses...; if you have 2 sockets, filled with 8 core CPUs, you are good with one standard license... If you have 20 total cores, you need a standard license, and a pair of '2 core' add ons... If you have 32 cores, you need 2 full standard licenses....
  • MDD1963 - Friday, August 9, 2019 - link

    Datacenter is still licensed for 16 cores, with little 2 pack increments available, or, in the case of a 64 core CPU, effectively 4 Datacenter licenses would be required...($6k per 16 cores, or, roughly $24k)
  • deltaFx2 - Friday, August 9, 2019 - link

    @schujj07: Of course I get that. The OP @Pancakes implied that Rome was going to hurt the wallets of buyers using windows server. The implication being this would not happen if they bought Intel. I was questioning those assumptions. How can Rome cost more money for windows licenses unless rome needs more cores to get the same job done or enterprises overprovision Rome (in terms of total cores) vs. Intel. That would make sense if the per-thread performance is worse but it's not.

Log in

Don't have an account? Sign up now