Zen 4 Execution Pipeline: Familiar Pipes With More Caching

Finally, let’s take a look at the Zen 4 microarchitecture’s execution flow in-depth. As we noted before, AMD is seeing a 13% IPC improvement over Zen 3. So how did they do it?

Throughout the Zen 4 architecture, there is not any single radical change. Zen 4 does make a few notable changes, but the basics of the instruction flow are unchanged, especially on the back-end execution pipelines. Rather, many (if not most) of the IPC improvements in Zen 4 come from improving cache and buffer sizes in some respect.

Starting with the front end, AMD has made a few important improvements here. The branch predictor, a common target for improvements given the payoffs of correct predictions, has been further iterated upon for Zen 4. While still predicting 2 branches per cycle (the same as Zen 3), AMD has increased the L1 Branch Target Buffer (BTB) cache size by 50%, to 2 x 1.5k entries. And similarly, the L2 BTB has been increased to 2 x 7k entries (though this is just an ~8% capacity increase). The net result being that the branch predictor’s accuracy is improved by being able to look over a longer history of branch targets.

Meanwhile the branch predictor’s op cache has been more significantly improved. The op cache is not only 68% larger than before (now storing 6.75k ops), but it can now spit out up to 9 macro-ops per cycle, up from 6 on Zen 3. So in scenarios where the branch predictor is doing especially well at its job and the micro-op queue can consume additional instructions, it’s possible to get up to 50% more ops out of the op cache. Besides the performance improvement, this has a positive benefit to power efficiency since tapping cached ops requires a lot less power than decoding new ones.

With that said, the output of the micro-op queue itself has not changed. The final stage of the front-end can still only spit out 6 micro-ops per clock, so the improved op cache transfer rate is only particularly useful in scenarios where the micro-op queue would otherwise be running low on ops to dispatch.

Switching to the back-end of the Zen 4 execution pipeline, things are once again relatively unchanged. There are no pipeline or port changes to speak of; Zen 4 still can (only) schedule up to 10 Integer and 6 Floating Point operations per clock. Similarly, the fundamental floating point op latency rates remain unchanged as 3 cycles for FADD and FMUL, and 4 cycles for FMA.

Instead, AMD’s improvements to the back-end of Zen 4 have here too focused on larger caches and buffers. Of note, the retire queue/reorder buffer is 25% larger, and is now 320 instructions deep, giving the CPU a wider window of instructions to look through to extract performance via out-of-order execution. Similarly, the Integer and FP register files have been increased in size by about 20% each, to 224 registers and 192 registers respectively, in order to accommodate the larger number of instructions that are now in flight.

The only other notable change here is AVX-512 support, which we touched upon earlier. AVX execution takes place in AMD’s floating point ports, and as such, those have been beefed up to support the new instructions.

Moving on, the load/store units within each CPU core have also been given a buffer enlargement. The load queue is 22% deeper, now storing 88 loads. And according to AMD, they’ve made some unspecified changes to reduce port conflicts with their L1 data cache. Otherwise the load/store throughput remains unchanged at 3 loads and 2 stores per cycle.

Finally, let’s talk about AMD’s L2 cache. As previously disclosed by the company, the Zen 4 architecture is doubling the size of the L2 cache on each CPU core, taking it from 512KB to a full 1MB. As with AMD’s lower-level buffer improvements, the larger L2 cache is designed to further improve performance/IPC by keeping more relevant data closer to the CPU cores, as opposed to ending up in the L3 cache, or worse, main memory. Beyond that, the L3 cache remains unchanged at 32MB for an 8 core CCX, functioning as a victim cache for each CPU core’s L2 cache.

All told, we aren’t seeing very many major changes in the Zen 4 execution pipeline, and that’s okay. Increasing cache and buffer sizes is another tried and true way to improve the performance of an architecture by keeping an existing design filled and working more often, and that’s what AMD has opted to do for Zen 4. Especially coming in conjunction with the jump from TSMC 7nm to 5nm and the resulting increase in transistor budget, this is good way to put those additional transistors to good use while AMD works on a more significant overhaul to the Zen architecture for Zen 5.

Zen 4 Architecture: Power Efficiency, Performance, & New Instructions Test Bed and Setup
POST A COMMENT

205 Comments

View All Comments

  • RestChem - Wednesday, October 5, 2022 - link

    Meh, time will out the ultimate price-points and all that, but as it emerges I really wonder what kind of users are looking to drop this kind of dollarses on high-end AMD builds. My gut is that they've priced themselves out of their primary demographic, and max TDP is right up there too, same as with their GPUs. When it comes down to a difference of a couple hundred bucks per build (assuming people build these with the pricey DDR5-6000 there's scant mobo support for through whatever AMD's integrated mem-OC profile scheme is) are there going to be enough users who just root hard enough for the underdog to build on these platforms, contra even high-end Alder Lake or (however much extra, reamins at time of writing to be seen) Raptor Lake builds? Before the announcements I was expecting AMD to get in cheap again, promise at least like performance for a bit of a discount, but it seems even those days are over and they want to play head-to-head. I wish them the best but I don't see them scoring well in that fight. Reply
  • tvdang7 - Thursday, October 6, 2022 - link

    " I have a 1440p 144Hz monitor and I play at 1080p just because that's what I'm used to."
    Is this some kind of joke. We are supposed to listen to reviewers that are stuck in 2010
    Reply
  • Hresna - Sunday, October 9, 2022 - link

    I’m curious as to whether there’s any appreciable difference to a consumer as to whether a particular PCIe lane or USB port is provisioned by the CPU or the Chipset…. Like, is there a reliability, performance, or some other metric difference?

    I’m just curious why it’s a design consideration to even include them in the CPU design to begin with, unless it has to do with how the CPU lanes are multiplexed in/out of the CPU and somehow some of the lanes can talk inter-device via the chipset without involving the cpu…
    Reply
  • bigtree - Monday, October 10, 2022 - link

    Where is octa channel memory? dual channel memory is a $300 CPU.
    Where is native Thunderbolt 4 support?
    (mac minis have had thunderbolt 3 for over 5 years).
    Cant even find one X670 Motherboard with 4x Thunderbolt 4 ports. And you want $300? Thunderbolt 4 should be standard on the cheapest boards. Its a $20 chip.
    Reply
  • Oxford Guy - Monday, October 10, 2022 - link

    The mission of corporations is to extract profit for shareholders and protect the lavish lifestyles of the rich. It is not to provide value to the plebs. Do the absolute minimum is the mantra. Reply
  • RedGreenBlue - Tuesday, October 11, 2022 - link

    That must be why Intel made Thunderbolt royalty-free and it’s now built into USB 4. Reply
  • Oxford Guy - Wednesday, October 12, 2022 - link

    It probably can afford to since states like Ohio are willing to bankroll half of the cost of its fabs. Reply
  • RedGreenBlue - Tuesday, October 11, 2022 - link

    It’s built into USB 4 now. Just make sure it’s functional already because it might need a driver, AMD did that on the 600 series. Aside from that important fact, I don’t care if there aren't many boards with it. The thunderbolt ecosystem has been crap since the beginning. Peripheral makers didn’t take advantage of it because USB was a more common approach and intel didn’t make thunderbolt cheap to implement. The Mac Minis have it because Apple made a big bet on it when it came out. These days it’s nice to have but it’s a throw-away feature unless you have a niche product that needs it. It’s for niche purposes and that would have been a waste of pci lanes. I would’ve liked it for external GPU’s but intel effectively shut that down and I don’t know if they’ve opened the door to it again. USB is way more convenient. Reply
  • RedGreenBlue - Tuesday, October 11, 2022 - link

    And 8 channel memory, like, this sounds like a joke. That’s for server or workstation cpus because of how many layers it takes for the wiring on the board and the pins on the socket. That’s part of why server and workstation boards are so expensive. If you need that much bandwidth you’re in the wrong market segment. Look at Threadripper chips. Reply
  • RedGreenBlue - Tuesday, October 11, 2022 - link

    It would be appreciated if architecture reviews had the pipeline differences in a chart to compare across generations. Anandtech used to have that included and it gave a good comparison of different generations and competitor architectures. I can understand not including it in the product review but I don’t remember a chart being in the previous Zen 4 overview article. Reply

Log in

Don't have an account? Sign up now