Zen 4 Execution Pipeline: Familiar Pipes With More Caching

Finally, let’s take a look at the Zen 4 microarchitecture’s execution flow in-depth. As we noted before, AMD is seeing a 13% IPC improvement over Zen 3. So how did they do it?

Throughout the Zen 4 architecture, there is not any single radical change. Zen 4 does make a few notable changes, but the basics of the instruction flow are unchanged, especially on the back-end execution pipelines. Rather, many (if not most) of the IPC improvements in Zen 4 come from improving cache and buffer sizes in some respect.

Starting with the front end, AMD has made a few important improvements here. The branch predictor, a common target for improvements given the payoffs of correct predictions, has been further iterated upon for Zen 4. While still predicting 2 branches per cycle (the same as Zen 3), AMD has increased the L1 Branch Target Buffer (BTB) cache size by 50%, to 2 x 1.5k entries. And similarly, the L2 BTB has been increased to 2 x 7k entries (though this is just an ~8% capacity increase). The net result being that the branch predictor’s accuracy is improved by being able to look over a longer history of branch targets.

Meanwhile the branch predictor’s op cache has been more significantly improved. The op cache is not only 68% larger than before (now storing 6.75k ops), but it can now spit out up to 9 macro-ops per cycle, up from 6 on Zen 3. So in scenarios where the branch predictor is doing especially well at its job and the micro-op queue can consume additional instructions, it’s possible to get up to 50% more ops out of the op cache. Besides the performance improvement, this has a positive benefit to power efficiency since tapping cached ops requires a lot less power than decoding new ones.

With that said, the output of the micro-op queue itself has not changed. The final stage of the front-end can still only spit out 6 micro-ops per clock, so the improved op cache transfer rate is only particularly useful in scenarios where the micro-op queue would otherwise be running low on ops to dispatch.

Switching to the back-end of the Zen 4 execution pipeline, things are once again relatively unchanged. There are no pipeline or port changes to speak of; Zen 4 still can (only) schedule up to 10 Integer and 6 Floating Point operations per clock. Similarly, the fundamental floating point op latency rates remain unchanged as 3 cycles for FADD and FMUL, and 4 cycles for FMA.

Instead, AMD’s improvements to the back-end of Zen 4 have here too focused on larger caches and buffers. Of note, the retire queue/reorder buffer is 25% larger, and is now 320 instructions deep, giving the CPU a wider window of instructions to look through to extract performance via out-of-order execution. Similarly, the Integer and FP register files have been increased in size by about 20% each, to 224 registers and 192 registers respectively, in order to accommodate the larger number of instructions that are now in flight.

The only other notable change here is AVX-512 support, which we touched upon earlier. AVX execution takes place in AMD’s floating point ports, and as such, those have been beefed up to support the new instructions.

Moving on, the load/store units within each CPU core have also been given a buffer enlargement. The load queue is 22% deeper, now storing 88 loads. And according to AMD, they’ve made some unspecified changes to reduce port conflicts with their L1 data cache. Otherwise the load/store throughput remains unchanged at 3 loads and 2 stores per cycle.

Finally, let’s talk about AMD’s L2 cache. As previously disclosed by the company, the Zen 4 architecture is doubling the size of the L2 cache on each CPU core, taking it from 512KB to a full 1MB. As with AMD’s lower-level buffer improvements, the larger L2 cache is designed to further improve performance/IPC by keeping more relevant data closer to the CPU cores, as opposed to ending up in the L3 cache, or worse, main memory. Beyond that, the L3 cache remains unchanged at 32MB for an 8 core CCX, functioning as a victim cache for each CPU core’s L2 cache.

All told, we aren’t seeing very many major changes in the Zen 4 execution pipeline, and that’s okay. Increasing cache and buffer sizes is another tried and true way to improve the performance of an architecture by keeping an existing design filled and working more often, and that’s what AMD has opted to do for Zen 4. Especially coming in conjunction with the jump from TSMC 7nm to 5nm and the resulting increase in transistor budget, this is good way to put those additional transistors to good use while AMD works on a more significant overhaul to the Zen architecture for Zen 5.

Zen 4 Architecture: Power Efficiency, Performance, & New Instructions Test Bed and Setup
POST A COMMENT

205 Comments

View All Comments

  • emn13 - Monday, September 26, 2022 - link

    The geekbench 4 ST results for the 7600x seem very low - is that benchmark result borked, or is there really something weird going on? Reply
  • emn13 - Monday, September 26, 2022 - link

    Sorry, I meant the geekbench 4 MT not ST results. The score trails way behind even the 3600xt. Reply
  • Silver5urfer - Monday, September 26, 2022 - link

    Good write up.

    First I would humbly request you to please include older Intel processors in your suite, it will be easier to understand the relative gains for eg the old 9th gen, 10th gen as a reliable place I see things all over on other sites, AT is at-least consistent so would be better if we have a ton of CPUs in one spot. Thanks

    Now speaking about this launch.

    The IOD is now improved by a huge factor so no more of that IF clock messing with the I/O controller and high voltage on the Zen 3 likes it's all improved so I think the USB fallout issues are fixed on this platform now. Plus the DP2.0 on iGPU is a dead giveaway on RDNA3 with DP2.0 as well.

    IMC is also improved looking at it AMD operated with synchronized clocks with DRAM now they can do it without that since IF is now at 2000MHz and the IMC and DRAM are higher at 3000MHz to match the DDR5 data rates. Plus the EXPO is also lower latency, however the MCM design causes the AIDA benchmark to have high latency vs Intel even though Intel is operating at Gear 2 ratio with similar Uncore decoupled. Surprisingly the inter core latencies did not change much, maybe that's one of the key to improving more on AMD side gotta see what they will do for Zen 5.

    The CPU clocks are insane, 5GHz on all 16C32T is a huge thing, plus even the 7600X is hitting 5.4GHz. Massive boost from AMD improving their design, plus the TSMC5N High Performance node is too good. However AMD did axed their temps and power. It's a very good move to not castrate the CPU with power limits and clocks now that's out it gets to spread it's wings. But the downside is, unlike Intel i7 series Ryzen 6 also gets hot meaning the budget buyers need to invest money in AIO vs older Zen 3 being fine on Air. That's a negative effect for AMD when they removed the Power Limits like Intel and let these rip to 250W.

    Chipset downlink capping at PCIe4.0x4 was the biggest negative I can think of it, because Intel DMI is now 4.0x8 on ADL and RPL, RKL had it at 3.0x8 CML at 3.0x4. AMD is stuck to 4.0x4 from X570. Many will not even care, but it is a disadvantage when you pay top money for X670E they should have given us the PCIe5.0x4, AMD will give that in 2024 with Zen 5 X770 chipset that's my guess.

    The ILM backplate engineering is solid that alone and the LGA1718 AM5 longevity itself is a major PLUS for AMD over LGA1700's bending ILM and EOL by 13th gen. Yes the 12th gen is a better purchase given how the Cooling requirement for i7 and i5 is not this high like R6 and R7 and the cheaper board costs plus 13th gen is coming and AMD's platform is new as well you would be a guinea pig. Depends on what people want and how much they can spend and what they want in longevity.

    Performance is top notch for 7600X and 7950X absolute sheer dominance but the pricing is higher when you see the % variance vs Zen 3 and Intel 12th gen parts, and added AIO mandatory because they are hot. The gaming performance is as expected not much to see here and the 5800X3D still is a contender there but to me that chip is worthless as it cannot match any processor in high core count workloads. Although 7600X is a champion 6C12T and it beats 12C24T in many things and the 10C20T 10th gen Intel too. IPC is massive in ST and MT workloads as expected. AMD Zen 4 will decimate ARM, Apple has only one thing lol muh efficiency all that BGA baggage, locked down ecosystem is free.

    RPSC3 perf at TPU's Red Dead Redemption is weird as I do not see any gains over Intel, given how much of a beast this AVX512 is on Zen 4 with 2x256Bit without AVX offset that too maybe they are not using AVX512. Plus their AMD Zen 3 gauging is also bad because they do not work well vs Intel 9th gen even, I wish you guys cover Dolphin emu, PCSX2, RPCS3 and Switch Emulators.

    I think best option is to wait for next year and buy these parts as they will drop, right now no PCIe5.0 SSD in high capacity. no PCIe5.0 GPU even that Nvidia skimped on it. No use of the new platform unless one is running a super damn old CPU and GPU setups.

    Shame that OC is totally dead, Zen 3 was hamfisted with its Curve Optimizer and Memory tuning becoming a head ache due to how AGESA was handled and the 1.4v high voltage and lack of documentation. Zen 4 it's even 1.0-1.2v still no OC because AMD's design basically is now pushed to maximum with it's Core TJMax temps and how it works on the basis of Core temperatures over everything else. There's no room here, AIO is saturated with 90C here. Too high heat density on AMD side similar to Intel 11th and 12th gen. Although Intel can go upto 350W and hit all cores at higher vs AMD 250W max. Well OC was on life support, only Intel is basically keeping it alive at this point after 10th gen it became worse and 12th very hot and high heat and now 13th gotta see if that DLVR regulator helps or not.

    All in all a good CPU but has some downsides to it. Not much worth for existing 2020 class HW folks at all. Better wait when DDR5 matures even further and more PCIe5.0 becomes prevalent.
    Reply
  • Threska - Monday, September 26, 2022 - link

    Maybe people will start delidding.

    https://youtu.be/y_jaS_FZcjI
    Reply
  • Silver5urfer - Tuesday, September 27, 2022 - link

    That Delid is a direct die, it will 100% ruin the AM5 socket for longevity and the whole CPU too. That guy runs HWBot, ofc he will make a video on his bs delid kits. Nobody should run any CPU completely blowing the IHS off. You will have a ton of issues with that. Water leak, CPU silicon die crack due to Thermodynamics and the pressure differences over the time, Liquid Metal leak. Total bust of Warranty on any parts once that LM drops on your machine game over for $5000 worth rig there.

    AMD should have done some more improvements and reduced the max TJ Max to say 90 at-least but it's what it is unfortunately (for high temps and cooling requirements) and fortunately (to have super high performance)
    Reply
  • Threska - Tuesday, September 27, 2022 - link

    There are some in the comments both wondering if lapping would achieve the same and the thicker lid was giving some room for future additions like 3D cache, etc. Reply
  • abufrejoval - Wednesday, September 28, 2022 - link

    I'm not sure that PCIe 4.0 "DMI" downlink capping is a hard cap per se by the SoC, but really the result of negotiations with the ASmedia chipset, which can't do better. I'd assume once someone comes up with a PCI 5.0 chipset/switch, there is no reason it won't do PCIe 5.0. It's just a bunch of 4 lanes, that happen to be connected to ASmedia PCIe 4.0 chips on all currrent mainboards.

    Likewise I don't see why you couldn't add the second chipset/switch to the "NVMe" port of the SoC or any of the bifurcated slots: what you see is motherboard design choices not Ryzen 7000 limitations. That just has 24 PCIe 5.0 lanes to offer in many bundle variants. It's the mainboard that straps all that flexibility to slots and ports.

    I don't see that you have to invest into AIO coolers, *unless* you want/need top clocks on all cores. E.g. if your workloads are mixed, e.g. a few threads that profit from top clocks for interactive workloads (including games) and others that are more batch oriented like large compiles or renders, you may get maximum personal value even from an air cooler that only handles 150 Watts.

    Because the interactive stuff will rev to 5.crazy clocks on say 4-8 cores, while for the batch stuff you may not wait in front of the screen anyway (or do other stuff while it's chugging in the background). So if it spends 2 extra hours on a job that might take 8 hours on AIO, that may be acceptable if it saves you from putting fluids into your computer.

    In a way AMD is now giving you a clear choice: The performance you can obtain from the high-end variants is mostly limited by the amount of cooling you want to provide. And as a side effect it also steers the power consumption: you provide 150 Watts worth of cooling, it won't consume more except for short bursts.

    In that regard it's much like a 5800U laptop, that you configure between say 15/28/35 Watts of TDP for distinct working points in terms of power vs. cooling/noise (and battery endurance).

    Hopefully AMD will provide integration tools on both Windows and Linux to check/measure/adjust the various power settings at run-time, so you can adjust your machine to your own noise/heat/performance bias, depending on the job it's running.
    Reply
  • Dug - Monday, September 26, 2022 - link

    "While these comments make sense, ultimately very few users apply memory profiles (either XMP or other) as they require interaction with the BIOS"

    This is getting so old. Your assumption is incorrect which should be obvious by the millions of articles and youtube videos on building computers. Not to mention your entire article is not even directed to "general public" but to enthusiasts. Otherwise why write out this entire article? Just say you put a cpu in a motherboard and it works. Say it's fast. Article done.

    Why not test with Curve Optimizer?
    Reply
  • Oxford Guy - Tuesday, September 27, 2022 - link

    This text appears again and again for the same reason Galileo was placed under house arrest. Reply
  • socket420 - Monday, September 26, 2022 - link

    Could someone, preferably Ryan or Gavin, please elaborate on what this sentence - "the new chip is compliant with Microsoft’s Pluton initiative as well" - actually means? This is the only review I could find that mentions Pluton in conjunction with desktop Zen 4 at all, but merely saying it's "compliant" is a weird way of wording it. Is Pluton on-die and enabled by default in Ryzen 7000 desktop CPUs? Reply

Log in

Don't have an account? Sign up now