AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome

Name: AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome
Item: AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome
Author: Dr. Ian Cutress

by Dr. Ian Cutress on June 10, 2019 7:22 PM EST

216 Comments | Add A Comment

216 Comments

Cache and Infinity Fabric

If it hasn’t been hammered in already, the big change in the cache is the L1 instruction cache which has been reduced from 64 KB to 32 KB, but the associativity has increased from 4-way to 8-way. This change enabled AMD to increase the size of the micro-op cache from 2K entry to 4K entry, and AMD felt that this gave a better performance balance with how modern workloads are evolving.

The L1-D cache is still 32KB 8-way, while the L2 cache is still 512KB 8-way. The L3 cache, which is a non-inclusive cache (compared to the L2 inclusive cache), has now doubled in size to 16 MB per core complex, up from 8 MB. AMD manages its L3 by sharing a 16MB block per CCX, rather than enabling access to any L3 from any core.

Because of the increase in size of the L3, latency has increased slightly. L1 is still 4-cycle, L2 is still 12-cycle, but L3 has increased from ~35 cycle to ~40 cycle (this is a characteristic of larger caches, they end up being slightly slower latency; it’s an interesting trade off to measure). AMD has stated that it has increased the size of the queues handling L1 and L2 misses, although hasn’t elaborated as to how big they now are.

Infinity Fabric

With the move to Zen 2, we also move to the second generation of Infinity Fabric. One of the major updates with IF2 is the support of PCIe 4.0, and thus the increase of the bus width from 256-bit to 512-bit.

Overall efficiency of IF2 has improved 27% according to AMD, leading to a lower power per bit. As we move to more IF links in EPYC, this will become very important as data is transferred from chiplet to IO die.

One of the features of IF2 is that the clock has been decoupled from the main DRAM clock. In Zen and Zen+, the IF frequency was coupled to the DRAM frequency, which led to some interesting scenarios where the memory could go a lot faster but the limitations in the IF meant that they were both limited by the lock-step nature of the clock. For Zen 2, AMD has introduced ratios to the IF2, enabling a 1:1 normal ratio or a 2:1 ratio that reduces the IF2 clock in half.

This ratio should automatically come into play around DDR4-3600 or DDR4-3800, but it does mean that IF2 clock does reduce in half, which has a knock on effect with respect to bandwidth. It should be noted that even if the DRAM frequency is high, having a slower IF frequency will likely limit the raw performance gain from that faster memory. AMD recommends keeping the ratio at a 1:1 around DDR4-3600, and instead optimizing sub-timings at that speed.

Integer Units, Load and Store Conclusions: Platform, SoC, Core

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

216 Comments

View All Comments

eek2121 - Wednesday, June 19, 2019 - link
I think what people are getting at is having an L4 Cache. Such a cache would be slower than L3, but would be much faster than DRAM (for now, DDR 5133 was recently demonstrated, that is 2566 MHz double data rate). HBM2 is a prime candidate for that because you can stick 8 Gb on a CPU for $60 and with some engineering work, it would help performance massively. 8 gb could hold practically everything needed in cache. That being said, there are engineering challenges to overcome and I doubt this will ever be a thing.

Once JEDEC approves RAM running at DDR 5600 at reasonable timings it won’t matter anyway. AMD can simply bump up the IF speed to 1:1 and with shortened RAM traces, performance penalties can be minimized.
jamescox - Saturday, June 22, 2019 - link
For an interposer based Epyc package for the next generation, I would expect perhaps they do an active interposer with all of the external interface transistors in the interposer. They could do similar things with a passive interposer also though. The passive interposer could be an intermediate between Zen 3 and Zen 4. Then they could place a large number of 7 nm+ chiplets on the interposer. As I said, it is hard to speculate, but an option that I thought of based on AdoredTV 15 chiplet rumor would be to have 4 memory controller chips, each one running 2 channels (128-bit) DDR5. Those chips would just be the memory controller logic if on an active interposer and the interfaces to the interposer connections. That isn’t much so at 7 nm and below, they could place massive L4 SRAM caches on the memory controller chips. Current ~75 square mm Zen 2 chiplets have 16 MB plus 8 cpu cores, so it could be a large amount of cache; perhaps something like 64 or 128 MB per chip. It wouldn’t be a cheap device, but AMD’s goal is to get into the high end market eventually.

The other chiplets could be 1 or two die to manage connections out to the cpu chiplets. This would just be the logic with an active interposer. With a regular interposer, it would need to have the IO transistors also, but the interfaces are quite small. A single infinity fabric switch chip handling all cpu chiplets could provide very low latency. They may have another chip with a switch to tie everything together or they could actually place a couple cpu chiplets on the interposer. Two extra cpu chiplets or one 16 core chiplet could be where the 80 core rumor came from. A possible reason to do that is to allow an HBM based gpu to be mounted on either side. That would make an exceptional HPC product with 16 cores (possible 64 threads if they go to 4 way SMT) and 2 HBM gpus. Another way to get 80 core would be to just make a 3 CCX chiplet with 12 cores. It looks like the Epyc package will not fit all 12 core die though. A mixture of 4 12-core and 4 8-core looks like it would fit, but it wouldn’t be symmetric though. That would allow a quick Zen 2+ style upgrade. Desktop might be able to go to 24 cores and Epyc to 80. The confusion could be mixing up a Zen 2+ rumor and a Zen 3 rumor or something like that. The interposer makes a lot of sense for the giant IO die that cannot be easily implemented at 7 nm. The yields probably don’t support that large of die, so you use an interposer and make a bunch of 100 square mm sized die instead.

I can’t rule out placing HBM on an IO interposer, but due to the latency not really being that much better than off package DRAM, especially at DDR5 speeds, it just doesn’t seem like they would do it.
nandnandnand - Sunday, July 7, 2019 - link
"That being said, there are engineering challenges to overcome and I doubt this will ever be a thing."

Putting large amounts of DRAM ever closer to the CPU will definitely be a thing:

https://www.darpa.mil/attachments/3DSoCProposersDa...

Intel is already moving in this direction with Foveros, and AMD is also working on it:

https://www.tomshardware.com/news/amd-3d-memory-st...

It doesn't matter how fast DDR5 is. The industry must move in this direction to grab performance and power efficiency gains.
AdrianMel - Sunday, June 16, 2019 - link
I would like these AMD chips to be used on laptops. It would be a breakthrough in computing power, low consumption. I think that if a HBM2 memory or a larger memory is integrated into the processor, I think it will double the computing power. It would be a study and implementation of 2 super ports, the old expresscard 54 in which we can insert 2 video cards in laptops
nandnandnand - Sunday, July 7, 2019 - link
AMD needs to put out some 6-8 core Zen 2 laptop chips.
peevee - Monday, June 17, 2019 - link
Does it mean that AVX2 performance doubles compared to Zen+? At least on workloads where data for the inner loop fits into L1D$ (hierarchical dense matrix multiplication etc)?
peevee - Monday, June 17, 2019 - link
"AMD manages its L3 by sharing a 16MB block per CCX, rather than enabling access to any L3 from any core."

Does it mean that for code and shared data caches, 64MB L3 on Ryzen 9 behaves essentially like 16MB cache (say, all 12/16 cores run the same code as it usually is in performance-critical client code and not 4+ different processes/VMs in parallel)? What a waste it is/would be...
jamescox - Saturday, June 22, 2019 - link
The caches on different CCXs can communicate with each other. In Zen 2, those one the same die probably communicate at core clock rather than at memory clock; there is no memory clock on the cpu chiplet. The speeds between chiplets have essentially more than doubled the clocks vs. Zen 1 and there is a possibility that they doubled the widths also. There just about isn’t any way to scale to such core counts otherwise.

An intel monolithic high core count device will have trouble competing. The latency of their mesh network will go up with more cores and it will burn a lot of power. The latency of the L3 with a mesh network will be higher than the latency within a 4-core CCX. Problems with the CCX architecture are mostly due to OS scheduler issues and badly written multithreaded code. Many applications performed significantly better on Linux compared to windows due to this.

The mesh network is also not workable across multiple chiplets. A 16-core (or even a 10 core) monolithic device would be quite large for 10 nm. They would be wasting a bunch of expensive 10 nm capacity on IO. With the large die size and questionable yields, it will be a much more expensive chip than AMD’s MCM. Also, current Intel chips top out at 38.5 MB of L3 cache on 14 nm. Those are mostly expensive Xeon processors. AMD will have a 32 MB part for $200 and a 64 MB part for $500. Even when Intel actually gets a 10 nm part on the desktop, it will likely be much more expensive. They are also going to have serious problems getting their 10 nm parts up to competitive clock speeds with the 14 nm parts. They have been tweaking 14 nm for something like 5+ years now. Pushing the clock on their problematic 10 nm process doesn’t sound promising.
peevee - Monday, June 17, 2019 - link
"One of the features of IF2 is that the clock has been decoupled from the main DRAM clock....

For Zen 2, AMD has introduced ratios to the IF2, enabling a 1:1 normal ratio or a 2:1 ratio that reduces the IF2 clock in half."

I have news for you - 2:1 is still COUPLED. False advertisement in the slides.

And besides, who in their right mind would want to halve IF clock to go from DDR3200 to even DDR4000 (with requisite higher timings)?
BMNify - Saturday, June 22, 2019 - link
the only real world test that matters in the UHD2/8K Rec. 2020/BT.2020 LIVE NHK/bbc broadast of the 2020 Summer Olympics will begin on Friday, 24 July and related video streams is can AMD Zen 2 do it can any pc core do realtime x264/x265/ffmpeg software encoding and x264/x265 compliant decoding (notice how many hw assisted encoders today dont decode to spec as seen when you re-enode them with the latest ffmpeg), how many 8k encodes and what overheads are remaining if any can even do one...

AMD Zen 2 Microarchitecture Analysis: Ryzen 3000 and EPYC Rome

Cache and Infinity Fabric

Infinity Fabric

Post Your Comment

216 Comments

View All Comments

eek2121 - Wednesday, June 19, 2019 - link

jamescox - Saturday, June 22, 2019 - link

nandnandnand - Sunday, July 7, 2019 - link

AdrianMel - Sunday, June 16, 2019 - link

nandnandnand - Sunday, July 7, 2019 - link

peevee - Monday, June 17, 2019 - link

peevee - Monday, June 17, 2019 - link

jamescox - Saturday, June 22, 2019 - link

peevee - Monday, June 17, 2019 - link

BMNify - Saturday, June 22, 2019 - link

Log in

Don't have an account? Sign up now