Section by Andrei Frumusanu

The New Zen 3 Core: Front-End Updates

Moving on, let’s see what makes the Zen3 microarchitecture tick and how detail on how it actually improves things compared to its predecessor design, starting off with the front-end of the core which includes branch prediction, decode, the OP-cache path and instruction cache, and the dispatch stage.

From a high-level overview, Zen3’s front-end looks the same as on Zen2, at least from a block-diagram perspective. The fundamental building blocks are the same, starting off with the branch-predictor unit which AMD calls state-of-the-art. This feeds into a 32KB instruction cache which forwards instructions into a 4-wide decode block. We’re still maintaining a two-way flow into the OP-queue, as when we see instructions again which have been previously decoded, they are then stored in the OP-cache from which they can be retrieved with a greater bandwidth (8 Mops/cycle) and with less power consumption.

Improvements of the Zen3 cores in the actual blocks here include a faster branch predictor which is able to predict more branches per cycle. AMD wouldn’t exactly detail what this means but we suspect that this could allude to now two branch predictions per cycle instead of just one. This is still a TAGE based design as had been introduced in Zen2, and AMD does say that it has been able to improve the accuracy of the predictor.

Amongst the branch unit structure changes, we’ve seen a rebalancing of the BTBs, with the L1 BTB now doubling in size from 512 to 1024 entries. The L2 BTB has seen a slight reduction from 7K to 6.5K entries, but allowed the structure to be more efficient. The indirect target array (ITA) has also seen a more substantial increase from 1024 to 1536 entries.

If there is a misprediction, the new design reduces the cycle latency required to get a new stream going. AMD wouldn’t exactly detail the exact absolute misprediction cycles or how faster it is in this generation, but it would be a more significant performance boost to the overall design if the misprediction penalty is indeed reduced this generation.

AMD claims no bubbles on most predictions due to the increased branch predictor bandwidth, here I can see parallels to what Arm had introduced with the Cortex-A77, where a similar doubled-up branch predictor bandwidth would be able to run ahead of subsequent pipelines stages and thus fill bubble gaps ahead of them hitting the execution stages and potentially stalling the core.

On the side of the instruction cache, we didn’t see a change in the size of the structure as it’s still a 32KB 8-way block, however AMD has improved its utilisation. Prefetchers are now said to be more efficient and aggressive in actually pulling data out of the L2 ahead of them being used in the L1. We don’t know exactly what kind of pattern AMD alludes to having improved here, but if the L1I behaves the same as the L1D, then adjacent cache lines would then be pulled into the L1I here as well. The part of having a better utilisation wasn’t clear in terms of details and AMD wasn’t willing to divulge more, but we suspect a new cache line replacement policy to be a key aspect of this new improvement.

Being an x86 core, one of the difficulties of the ISA is the fact that instructions are of a variable length with encoding varying from 1 byte to 15 bytes. This has been legacy side-effect of the continuous extensions to the instruction set over the decades, and as modern CPU microarchitectures become wider in their execution throughput, it had become an issue for architects to design efficient wide decoders. For Zen3, AMD opted to remain with a 4-wide design, as going wider would have meant additional pipeline cycles which would have reduced the performance of the whole design.

Bypassing the decode stage through a structure such as the Op-cache is nowadays the preferred method to solve this issue, with the first-generation Zen microarchitecture being the first AMD design to implement such a block. However, such a design also brings problems, such as one set of instructions residing in the instruction cache, and its target residing in the OP-cache, again whose target might again be found in the instruction cache. AMD found this to be a quite large inefficiency in Zen2, and thus evolved the design to better handle instruction flows from both the I-cache and the OP-cache and to deliver them into the µOP-queue. AMD’s researchers seem to have published a more in-depth paper addressing the improvements.

On the dispatch side, Zen3 remains a 6-wide machine, emitting up to 6-Macro-Ops per cycle to the execution units, meaning that the maximum IPC of the core remains at 6. The Op-cache being able to deliver 8 Macro-Ops into the µOp-queue would serve as a mechanism to further reduce pipeline bubbles in the front-end – as the full 8-wide width of that structure wouldn’t be hit at all times.

On the execution engine side of things, we’ve seen a larger overhaul of the design as the Zen3 core has seen a widening of both the integer and floating-point issue width, with larger execution windows and lower latency execution units.

Starting off in more detail on the integer side, the one larger change in the design has been a move from individual schedulers for each of the execution units to a more consolidated design of four schedulers issuing into two execution units each. These new 24-entry schedulers should be more power efficient than having separate smaller schedulers, and the entry capacity also grows slightly from 92 to 96.

The physical register file has seen a slight increase from 180 entries to 192 entries, allowing for a slight increase in the integer OOO-window, with the actual reorder-buffer of the core growing from 224 instructions to 256 instructions, which in the context of competing microarchitectures such as Intel’s 352 ROB in Sunny Cove or Apple giant ROB still seems relatively small.

The overall integer execution unit issue width has grown from 7 to 10. The breakdown here is that while the core still has 4 ALUs, we’ve now seen one of the branch ports separate into its own dedicated unit, whilst the other unit still shares the same port as one of the ALUs, allowing for the unshared ALU to dedicate itself more to actual arithmetic instructions. Not depicted here is an additional store unit, as well as a third load unit, which is what brings us to 10 issue units in total on the integer side.

On the floating-point side, the dispatch width has been increased from 4 µOps to 6 µOps. Similar to the integer pipelines, AMD has opted to disaggregate some of the pipelines capabilities, such as moving the floating point store and floating-point-to-integer conversion units into their own dedicated ports and units, so that the main execution pipelines are able to see higher utilisation with actual compute instructions.

One of the bigger improvements in the instruction latencies has been the shaving off of a cycle from 5 to 4 for fused multiply accumulate operations (FMAC). The scheduler on the FP side has also seen an increase in order to handle more in-flight instructions as loads on the integer side are fetching the required operands, although AMD here doesn’t disclose the exact increases.

Zen 3 At A Glance: Behind The +19% IPC Increase Zen 3: Load/Store and a Massive L3 Cache
Comments Locked

339 Comments

View All Comments

  • jakky567 - Tuesday, November 24, 2020 - link

    Total system, I think the 5950x should be more popular. That being said, the 5900x is still great.
  • mdriftmeyer - Monday, November 9, 2020 - link

    I spend $100 or more per week on extra necessities from Costco. Your price hike concerns are laughable.
  • bananaforscale - Monday, November 9, 2020 - link

    5900X has good binning and the cheapest price per core. For productivity 3900X has *nothing* on 5900X for the 10% price difference and 5950X is disproportionately more expensive. Zen and Zen+ are not an option if you want high IPC, 3300X basically doesn't exist... I'll give you that 3600 makes more sense to most people than 5600X, it's not that much faster.
  • Kangal - Wednesday, November 11, 2020 - link

    "Price per Core".... yeah, that's a pointless metric.
    What you need to focus on is "Price per Performance", and this should be divided into two segments: Gaming Performance, Productivity Performance. You shouldn't be running productivity tools whilst gaming for plenty of reasons (game crashes, tool errors, attention span, etc etc). The best use case for a "mixed/hybrid" would be Twitch Gaming, that's still a niche case.... but that's where the 5800X and 5900X makes sense.

    Now, I don't know what productivity programs you would use, nor would I know which games you would play, or if you plan on becoming a twitcher. So for your personal needs, you would have to figure that out yourself. Things like memory configurations and storage can have big impacts on productivity. Whereas for Gaming the biggest factor is which GPU you use.

    What I'm grasping at is the differences should/will decrease for most real-world scenarios, as there is something known as GPU scaling and being limited or having bottlenecks. For instance, RTX 2070-Super owners would target 1440p, and not 1080p. Or RTX 3090 owners would target 4K, and not for 1440p. And GTX 1650 owners would target 1080p, they wouldn't strive for 4K or 1440p.

    For instance, if you combine a 5600X with a Ultra-1440p-card, and compare the performance to a 3600X, the differences will diminish significantly. And at Ultra/4K both would be entirely GPU limited, so no difference. So if you compare a 5800X to a 3900X, the 3900X would come cheaper/same price but offer notably better productivity performance. And when it comes to gaming they would be equal/very similar when you're (most likely) GPU limited. That scenario applies to most consumers. However, there are outliers or niche people, who want to use a RTX 3090 to run CS GO at 1080p-Low Settings so they can get the maximum frames possible. This article alludes to what I have mentioned. But for more details, I would recommend people watch HardwareUnboxed video from YouTube, and see Steve's tests and hear his conclusions.

    Whereas here is my recommendation for the smart buyer, do not buy the 5600X or 5800X or 5900X. Wait a couple months and buy then. For Pure Gaming, get the r5-5600 which should have similar gaming performance but come in at around USD $220. For Productivity, get the r7-5700 which should have similar performance to the 5800X but come in at around USD $360. For the absolute best performance, buy the r9-5950x now don't wait. And what about Twitch Streamers? Well, if you're serious then build one Gaming PC, and a second Streaming PC, as this would allow your game to run fast, and your stream to flow fluidly.... IF YOU HAVE A GOOD INTERNET CONNECTION (Latency, Upload, Download).
  • lwatcdr - Monday, November 9, 2020 - link

    "You can get the 3700 for much cheaper than the 5800X. Or for the same price you can get the 3900X instead."
    And if you want both gaming and productivity? They get the 5800X or 5900X. So AMD has something for every segment which is great.
  • TheinsanegamerN - Thursday, November 12, 2020 - link

    The 5900x is margin of error from the 5950x in games, still shows a small uptick in gaming compared to 5800/5600x, offers far better performance then 5600/5800x in productivity tasks, and is noticeably cheaper then the 5950x.

    How on earth is that a non buy?

    The rest may be better value for money, but by that metric a $2 pentium D 945 is still far better value for money depending on the task. The 5000 series consistently outperforms the 3000 series, offring 20% better performance for 10% better cash.
  • Kishoreshack - Saturday, November 14, 2020 - link

    AMD has the best products to offer
    Soo you expect them to sell it at a cheaper rate than intel ?
  • Threska - Monday, November 16, 2020 - link

    AMD has a good product RANGE, which means something for everyone AND all monies go to AMD regardless of consumer choice.
  • Ninjawithagun - Friday, November 20, 2020 - link

    The price hike is mainly to cover ongoing R&D for the next-gen Ryzen Zen 4 CPUs due out in 2022. The race between Intel and AMD must go on!
  • jakky567 - Monday, November 23, 2020 - link

    I disagree about the 5900x being a no buy.

    I feel like it goes 5950x for absolute performance. 5900x for high tier performance on a budget. And then the 3000 series for people on a budget, except the 3950x.

    The 5900x has all the l3 cache.

Log in

Don't have an account? Sign up now