Cache Hierarchy Changes: Double L3, Faster Memory

Among the biggest changes of the Ryzen 3000, alongside the improved core microarchitecture, is the chip’s overall cache hierarchy. The new chiplet houses CCXes with double the amount of L3, now 16MB instead of 8MB.

Furthermore the chiplet design with the introduction of the cIO die which houses the new memory controllers are undoubtedly going to have an impact on the memory latency and performance of the overall chip.

On the memory controller side particularly, AMD promises a wholly revamped design that brings new support for a whole lot faster DDR4 modules, with the chip coming by default categorized as supporting DDR4-3200, which is a bump over the DDR-2933 support of the Ryzen 2000 series.

AMD had published an interesting slide in regards to the new faster DDR support that went well above the officially supported 3200 speeds, with AMD claiming that the new controllers are able to support up to DDR4-4200 with ease and overclocking being possible to achieve ever higher speeds. However there’s a catch: in order to support DDR4 above 3600, the chip will automatically change the memory controller to infinity fabric clock ratio from being 1:1 to 2:1.

Whilst this doesn’t bottleneck the bandwidth of the memory to the cores as the new microarchitecture has now doubled the bus width of the Infinity Fabric to 512 bits, it does add a notable amount of cycles to the overall memory latency, meaning for the very vast majority of workloads, you’re better off staying at or under DDR4-3600 with a 1:1 MC:IF ratio. It’s to be noted that it’s still possible to maintain this 1:1 ratio by manually adjusting it at higher MC speeds, however stability of the system is no longer guaranteed as you’re effectively overclocking the Infinity Fabric as well in such a scenario.

For this article we didn’t have enough time to dive into the scaling behaviour of the different DRAM speeds, what we did investigate is a more architectural question of how exactly the new chiplet and cIO die architecture has impacted Zen2’s memory latency and memory performance.

To give better insights, we’re using my custom memory latency test that I use for mobile SoC testing and first covered in our review of the Galaxy S10+ and its two SoCs. Memory latency testing nowadays is a complicated topic as microarchitectures advance at a rapid rate, and in particular prefetchers can cause for sometimes misleading figures. Similarly, more brute-force approaches such as full random tests contain a lot of TLB miss latencies which don’t represent the actual structural latency of the system. Our custom latency suite thus isn’t a single one-number-fits-all test but rather a collection of tests that expose more details of the memory behaviour of the system.

The figures published on this page are run on DDR4-3200CL16 on the Ryzen 3900X and 2700X at timings of 16-16-16-36, and the i9-9900K was run with similar DDR4-3200CL16 at timings of 16-18-18-36.

  

Looking at the memory latency curves in a linear plotted graph, we see that there’s some larger obvious differences between the new Ryzen 3900X and the Ryzen 2700X. What immediately catches the eye when switching between the two results is the new 16MB L3 cache capacity which doubles upon the 8MB of Zen1. We have to remind ourselves that even though the whole chip contains 64MB of L3 cache, this is not a unified cache and a single CPU core will only see its own CCX’s L3 cache before going into main memory, which is in contrast to Intel’s L3 cache where all the cores have access to the full amount.

Before going into more details in the next graph, another thing that is obvious is that seemingly the 3900X’s DRAM latency is a tad worse than the 2700X’s. Among the many test patterns here the one to note is the “Structural Estimate” curve. This curve is actually a simple subtraction of the TLB+CLR Thrash tests minus the TLB Penalty figure. In the former, we’re causing as much cache-line replacement pressure as possible by repeatedly hitting the same cacheline within each memory page, also repeatedly trying to miss the TLB. In the latter, we’re still hitting the TLB heavily, but always using a different cache-line and thus having a minimum of cache-line pressure, resulting in an estimate of the TLB penalty. Subtracting the latter from the former gives us a quite good estimate of the actual structural latency of the chip and memory.

Now the big question is, why do it this way? I’ve found that with increasingly better prefetchers, it’s getting difficult in getting good memory latency numbers. Whilst it’s possible to just outright disable prefetchers on some platforms, that avenue isn’t always available.

Precisely when looking at the other various patterns in the graph, we’re seeing quite a large difference between the 3900X and the 2700X, with the 3900X showcasing notably lower latencies in a few of them. These figures are now a result of the new Zen2’s improved prefetchers which are able to better recognize patterns and pull out data from DRAM before the CPU core will handle that memory address.

  

Plotting the same data on a logarithmic graph, we better see some of the details.

In terms of the DRAM latency, it seems that the new Ryzen 3900X has regressed by around 10ns when compared to the 2700X (Note: Just take into the leading edge of the “Structural Estimate” figures as the better estimate) with ~74-75.5ns versus ~65.7ns.

It also looks like Zen2’s L3 cache has also gained a few cycles: A change from ~7.5ns at 4.3GHz to ~8.1ns at 4.6GHz would mean a regression from ~32 cycles to ~37 cycles. Such as change however was to be expected since doubling of the L3 cache structure has to come with some implementation compromises as there’s never just a free lunch. Zen2’s L3 cache latency is thus now about the same as Intel’s – while it was previously faster on Zen+.

Further interesting characteristics we see here is the increase of the capacity of the L2 TLB. This can be seen in the “TLB Penalty” curve, and the depth here corresponds to AMD’s published details of increasing the structure from 1536 pages to 2048 pages. It’s to be noted that the L3 capacity now exceeds the capacity of the TLB, meaning a single CPU core will have only the best access latencies to up to 8MB in the cache before starting to have to page-walk. A similar behaviour we see in the L2 cache where the L1 TLB capacity only covers 256KB of the cache before having to look up entries in the L2 TLB.

Another very interesting characteristic of AMD’s microarchitecture which contrasts Intel’s, is the fact that AMD prefetches all patterns into the L2 cache, while Intel only does so for the nearest cache-line. Such a behaviour is a double-edged sword, on one hand AMD’s cores have can have better latencies to needed data, but on the other hand in the case of a unneeded prefetch, it puts a lot more pressure on the L2 cache capacity, and could in effect counter-act some of the benefits of having double the capacity over Intel’s design.

  

Switching over to the memory bandwidth of the cache hierarchy, there’s one obvious new chance in the 3900X and Zen2: the inclusion of 256-bit wide datapaths. The new AGU and path changes mean that the core is able to now handle 256-bit AVX instruction once per cycle which is a doubling over the 128-bit datapaths of Zen and Zen+.

So while the bandwidth of 256-bit operations on the Ryzen 2700X looked identical to the 128-bit variants, the wider ops now on Zen2 effectively double the bandwidth of the core. This bandwidth doubling is evident in the L1 cache (The flip test is equal to a memory copy test), however the increase is only about 20% for the L2 and L3 caches.

There’s an interesting juxtaposition between AMD’s L3 cache bandwidth and Intel’s: AMD essentially has a 60% advantage in bandwidth, as the CCX’s L3 is much faster than Intel’s L3 when accessed by a single core. Particularly read-write modifications within a single cache-line (CLflip test) are significantly faster in both the L2 and L3 caches when compared to Intel’s core design.

Deeper into the DRAM regions, however we see that AMD is still lagging behind Intel when it comes to memory controller efficiency, so while the 3900X improves copy bandwidth from 19.2GB/s to 21GB/s, it still remains behind the 9900K’s 22.9GB/s. The store bandwidth (write bandwidth) to memory is also a tad lower on the AMD parts as the 3900X reaches 14.5GB/s versus Intel’s 18GB/s.

 

One aspect that AMD excels in is memory level parallelism. MLP is the ability for the CPU core to “park” memory accesses when they are missing the caches, and wait on them to return back later. In the above graph we see increasing number of random memory accesses depicted as the stacked lines, with the vertical axis showcasing the effective access speedup in relation to a single access.

Whilst both AMD and Intel’s MLP ability in the L2 are somewhat the same and reach 12 – this is because we’re saturating the bandwidth of the cache in this region and we just can’t go any faster via more accesses. In the L3 region however we see big differences between the two: While Intel starts off with around 20 accesses at the L3 with a 14-15x speedup, the TLBs and supporting core structures aren’t able to sustain this properly over the whole L3 as it’s having to access other L3 slices on the chip.

AMD’s implementation however seems to be able to handle over 32 accesses with an extremely robust 23x speedup. This advantage actually continues on to the DRAM region where we still see speed-ups up to 32 accesses, while Intel peaks at 16.

MLP ability is extremely important in order to be able to actually hide the various memory hierarchy latencies and to take full advantage of a CPU’s out-of-order execution abilities. AMD’s Zen cores here have seemingly the best microarchitecture in this regard, with only Apple’s mobile CPU cores having comparable characteristics. I think this was very much a conscious design choice of the microarchitecture as AMD knew their overall SoC design and future chiplet architecture would have to deal with higher latencies, and did their best in order to minimise such a disadvantage.

So while the new Zen2 cores do seemingly have worse off latencies, possibly a combined factor of a faster memory controller (faster frequencies could have come at a cost of latency in the implementation), a larger L3 but with additional cycles, it doesn’t mean that memory sensitive workloads will see much of a regression. AMD has been able to improve the core’s prefetchers, and average workload latency will be lower due to the doubled L3, and this is on top the core’s microarchitecture which seems to have outstandingly good MLP ability for whenever there is a cache miss, something to keep in mind as we investigate performance further.

New Microarchitecture, New 7nm Process Node, New Chiplet Design X570 Motherboards: PCIe 4.0 For Everybody
POST A COMMENT

452 Comments

View All Comments

  • Death666Angel - Tuesday, July 9, 2019 - link

    Well, the thing is that motherboard manufacturers, motherboard revisions, motherboard layout and BIOS versions do play a role as well, though. The memory controller is just one piece of the puzzle. If you have a CPU with a great memory controller, it doesn't mean it performs the same on all boards. And it doesn't mean it performs the same with all RAM either. Sometimes the actual traces on motherboards are crap for certain clockspeeds. Sometimes the BIOS numbers for secondary and tertiary timings are crap at certain clockspeeds and get better in later revisions, seemingly allowing for better memory clockspeeds when it really was just a question of auto vs manual if you knew what you were doing. Sometimes the SoC voltage is worse on that board vs the other and that influences things. The thing is, across the board, X570 motherboards have higher advertised OC clockspeeds for the memory and Ryzen 3000 has higher guaranteed clockspeeds. And Anandtech believes that is the thing that counts, not if you can get x clockspeed stable. At least in the vanilla CPU articles. They do separate RAM articles often. Reply
  • wellington759 - Tuesday, August 13, 2019 - link

    Thank you for sharing this amazing idea, really appreciates your post.
    http://thestoreguide.co.nz/
    Reply
  • BLu3HaZe - Tuesday, July 9, 2019 - link

    "Some motherboard vendors are advertising speeds of up to DDR4-4400 which until Zen 2, was unheard of. Zen 2 also marks a jump up to DDR4-3200 up from DDR4-2933 on Zen+, and DDR4-2667 on Zen."

    How about now? :)

    And I believe the authors mean to say that official support for is up to 3200 on X570 boards, while older boards were rated lower "officially" corresponding to the generation they launched with. Speeds above that would be listed with (OC) clearly marked in memory support.
    Anything above the 'rated' speeds, you're technically overclocking the Infinity Fabric until you run in 2:1 mode which is only on Zen 2 anyhow, so your mileage will definitely vary.

    Even the 9900K 'officially' supports only DDR4-2666 but we all know how high it can go without any issues combined with XMP OC.
    Reply
  • Ratman6161 - Wednesday, July 10, 2019 - link

    In Zen and Zen +, the infinity fabric speed was tied to the memory speed. So overclock the RAM and you were also overclocking the infinity fabric. In Zen 2 infinity fabric is independent of the RAM speed. Reply
  • Targon - Monday, July 8, 2019 - link

    I am curious about the DDR4-3200 CL16 memory in the Ryzen test. CL16 RAM is considered the "cheap crap" when it comes to DDR4-3200, and my own Hynix M-die garbage memory is exactly that, G.skill Ripjaws V 3200CL16. On first generation Ryzen, getting it to 3200 speeds just hasn't happened, and I know that for gaming, CL16 vs. CL14 is enough to cause the slight loss to Intel(meaning Intel wouldn't have the lead in the gaming tests). Reply
  • Ninjawithagun - Monday, July 8, 2019 - link

    Regardless of whether or not a 'crap' DRAM kit having CL16 vs. a much more expensive kit with lower CL rating, it isn't going to make any significant difference in performance. This has been proven again and again. Reply
  • Ratman6161 - Wednesday, July 10, 2019 - link

    "CL16 RAM is considered the "cheap crap" when it comes to DDR4-3200"

    Since when? Yes its cheap(er) but I'd disagree with the "crap" part. I needed 32 Gb of RAM so that's either 2x16 with 16 GB modules usually being double sided (a crap shoot) or 4x8 with 4 modules being a crap shoot. Looking at current pricing (not the much higher prices from back when I bought) New egg has the G-skill ripjaws 2x16 CAS 16 kit for $135 while the Trident Z 2x16 CAS 15 for $210 or the CAS 14 Trident Z for $250. So I'd be paying $75 to $115 more...for something that isn't likely to do any better in my real world configuration. Even if I could hit its advertised CAS 15 or 14, how much is that worth. So I'd say the RipJaws is not "cheap crap". Its a "value" :)
    Reply
  • Domaldel - Wednesday, July 10, 2019 - link

    It's considered "cheap crap" because you can't guarantee that it's Samsung B-die at those speeds while you can with DDR4 3200 MHz CL14 as nothing else is able to reach those speeds and latencies then a good B-die.
    What that means is that you can actually have a shot at manually overclocking it further while keeping compatibility with Ryzen (if you tweak the timings and sub-timings) while you couldn't really with other memory kids on the first two generations of Ryzen.
    I don't have a Ryzen 3xxx series of chip so I can't really comment on those...
    Reply
  • WaltC - Monday, July 15, 2019 - link

    Since about the 2nd AGESA implementation, on my original x370 Ryzen 1 mboard, my "cheap crap"...;)...Patriot Viper Elite 16CL 2x8GB has had no problem with 3200Mhz at stock timings. used the same on a x47- mboard, and now it's running at 3200MHz on my x570 Aorus Master board--no problems. Reply
  • jgraham11 - Tuesday, July 16, 2019 - link

    DDR4 3200 is apparently not an overclock. Says so on AMD's specs page for the 3700X

    https://www.amd.com/en/products/cpu/amd-ryzen-7-37...
    Reply

Log in

Don't have an account? Sign up now