AMD Zen 3 Ryzen Deep Dive Review: 5950X, 5900X, 5800X and 5600X Tested
by Dr. Ian Cutress on November 5, 2020 9:01 AM ESTNew and Improved Instructions
When it comes to instruction improvements, moving to a brand new ground-up core enables a lot more flexibility in how instructions are processed compared to just a core update. Aside from adding new security functionality, being able to rearchitect the decoder/micro-op cache, the execution units, and the number of execution units allows for a variety of new features and hopefully faster throughput.
As part of the microarchitecture deep-dive disclosures from AMD, we naturally get AMD’s messaging on the improvements in this area – we were told of the highlights, such as the improved FMAC and new AVX2/AVX256 expansions. There’s also Control-Flow Enforcement Technology (CET) which enables a shadow stack to protect against ret/ROP attacks. However after getting our hands on the chip, there’s a trove of improvements to dive through.
Let’s cover AMD’s own highlights first.
The top cover item is the improved Fused Multiply-Accumulate (FMA), which is a frequently used operation in a number of high-performance compute workloads as well as machine learning, neural networks, scientific compute and enterprise workloads.
In Zen 2, a single FMA took 5 cycles with a throughput of 2/clock.
In Zen 3, a single FMA takes 4 cycles with a throughput of 2/clock.
This means that AMD’s FMAs are now on parity with Intel, however this update is going to be most used in AMD’s EPYC processors. As we scale up this improvement to the 64 cores of the current generation EPYC Rome, any compute-limited workload on Rome should be freed in Naples. Combine that with the larger L3 cache and improved load/store, some workloads should expect some good speed ups.
The other main update is with cryptography and cyphers. In Zen 2, vector-based AES and PCLMULQDQ operations were limited to AVX / 128-bit execution, whereas in Zen 3 they are upgraded to AVX2 / 256-bit execution.
This means that VAES has a latency of 4 cycles with a throughput of 2/clock.
This means that VPCLMULQDQ has a latency of 4 cycles, with a throughput of 0.5/clock.
AMD also mentioned to a certain extent that it has increased its ability to process repeated MOV instructions on short strings – what used to not be so good for short copies is now good for both small and large copies. We detected that the new core performs better REP MOV instruction elimination at the decode stage, leveraging the micro-op cache better.
Now here’s the stuff that AMD didn’t talk about.
Integer
Sticking with instruction elimination, a lot of instructions and zeroing idioms that Zen 2 used to decode but then skip execution are now detected and eliminated at the decode stage.
- NOP (90h) up to 5x 66h
- LNOP3/4/5 (Looped NOP)
- (V)MOVAPS/MOVAPD/MOVUPS/MOVUPD vec1, vec1 : Move (Un)Aligned Packed FP32/FP64
- VANDNPS/VANDNPD vec1, vec1, vec1 : Vector bitwise logical AND NOT Packed FP32/FP64
- VXORPS/VXORPD vec1, vec1, vec1 : Vector bitwise logical XOR Packed FP32/FP64
- VPANDN/VPXOR vec1, vec1, vec1 : Vector bitwise logical (AND NOT)/XOR
- VPCMPGTB/W/D/Q vec1, vec1, vec1 : Vector compare packed integers greater than
- VPSUBB/W/D/Q vec1, vec1, vec1 : Vector subtract packed integers
- VZEROUPPER : Zero upper bits of YMM
- CLC : Clear Carry Flag
As for direct performance adjustments, we detected the following:
Zen3 Updates (1) Integer Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
XCHG | Exchange Register/Memory with Register |
17 cycle latency | 7 cycle latency |
LOCK (ALU) | Assert LOCK# Signal | 17 cycle latency | 7 cycle latency |
ALU r16/r32/r64 imm | ALU on constant | 2.4 per cycle | 4 per cycle |
SHLD/SHRD | FP64 Shift Left/Right | 4 cycle latency 0.33 per cycle |
2 cycle latency 0.66 per cycle |
LEA [r+r*i] | Load Effective Address | 2 cycle latency 2 per cycle |
1 cycle latency 4 per cycle |
IDIV r8 | Signed Integer Division | 16 cycle latency 1/16 per cycle |
10 cycle latency 1/10 per cycle |
DIV r8 | Unsigned Integer Division | 17 cycle latency 1/17 per cycle |
|
IDIV r16 | Signed Integer Division | 21 cycle latency 1/21 per cycle |
12 cycle latency 1/12 per cycle |
DIV r16 | Unsigned Integer Division | 22 cycle latency 1/22 per cycle |
|
IDIV r32 | Signed Integer Division | 29 cycle latency 1/29 per cycle |
14 cycle latency 1/14 per cycle |
DIV r32 | Unsigned Integer Division | 30 cycle latency 1/30 per cycle |
|
IDIV r64 | Signed Integer Division | 45 cycle latency 1/45 per cycle |
19 cycle latency 1/19 per cycle |
DIV r64 | Unsigned Integer Division | 46 cycle latency 1/46 cycle latency |
20 cycle latency 1/20 per cycle |
Zen3 Updates (2) Integer Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
LAHF | Load Status Flags into AH Register |
2 cycle latency 0.5 per cycle |
1 cycle latency 1 per cycle |
PUSH reg | Push Register Onto Stack | 1 per cycle | 2 per cycle |
POP reg | Pop Value from Stack Into Register |
2 per cycle | 3 per cycle |
POPCNT | Count Bits | 3 per cycle | 4 per cycle |
LZCNT | Count Leading Zero Bits | 3 per cycle | 4 per cycle |
ANDN | Logical AND | 3 per cycle | 4 per cycle |
PREFETCH* | Prefetch | 2 per cycle | 3 per cycle |
PDEP/PEXT | Parallel Bits Deposit/Extreact |
300 cycle latency 250 cycles per 1 |
3 cycle latency 1 per clock |
It’s worth highlighting those last two commands. Software that helps the prefetchers, due to how AMD has arranged the branch predictors, can now process three prefetch commands per cycle. The other element is the introduction of a hardware accelerator with parallel bits: latency is reduced 99% and throughput is up 250x. If anyone asks why we ever need extra transistors for modern CPUs, it’s for things like this.
There are also some regressions
Zen3 Updates (3) Slower Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
CMPXCHG8B | Compare and Exchange 8 Byte/64-bit |
9 cycle latency 0.167 per cycle |
11 cycle latency 0.167 per cycle |
BEXTR | Bit Field Extract | 3 per cycle | 2 per cycle |
BZHI | Zero High Bit with Position | 3 per cycle | 2 per cycle |
RORX | Rorate Right Logical Without Flags |
3 per cycle | 2 per cycle |
SHLX / SHRX | Shift Left/Right Without Flags |
3 per cycle | 2 per cycle |
As always, there are trade offs.
x87
For anyone using older mathematics software, it might be riddled with a lot of x87 code. x87 was originally meant to be an extension of x86 for floating point operations, but based on other improvements to the instruction set, x87 is somewhat deprecated, and we often see regressed performance generation on generation.
But not on Zen 3. Among the regressions, we’re also seeing some improvements. Some.
Zen3 Updates (4) x87 Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
FXCH | Exchange Registers | 2 per cycle | 4 per cycle |
FADD | Floating Point Add | 5 cycle latency 1 per cycle |
6.5 cycle latency 2 per cycle |
FMUL | Floating Point Multiply | 5 cycle latency 1 per cycle |
6.5 cycle latency 2 per cycle |
FDIV32 | Floating Point Division | 10 cycle latency 0.285 per cycle |
10.5 cycle latency 0.800 per cycle |
FDIV64 | 13 cycle latency 0.200 per cycle |
13.5 cycle latency 0.235 per cycle |
|
FDIV80 | 15 cycle latency 0.167 per cycle |
15.5 cycle latency 0.200 per cycle |
|
FSQRT32 | Floating Point Square Root |
14 cycle latency 0.181 per cycle |
14.5 cycle latency 0.200 per cycle |
FSQRT64 | 20 cycle latency 0.111 per cycle |
20.5 cycle latency 0.105 per cycle |
|
FSQRT80 | 22 cycle latency 0.105 per cycle |
22.5 cycle latency 0.091 per cycle |
|
FCOS 0.739079 |
cos X = X | 117 cycle latency 0.27 per cycle |
149 cycle latency 0.28 per cycle |
The FADD and FMUL improvements mean the most here, but as stated, using x87 is not recommended. So why is it even mentioned here? The answer lies in older software. Software stacks built upon decades old Fortran still use these instructions, and more often than not in high performance math codes. Increasing throughput for the FADD/FMUL should provide a good speed up there.
Vector Integers
All of the vector integer improvements fall into two main categories. Aside from latency improvements, some of these improvements are execution port specific – due to the way the execution ports have changed this time around, throughput has improved for large numbers of instructions.
Zen3 Updates (5) Port Vector Integer Instructions |
||||
AnandTech | Instruction | Vector | Zen2 | Zen 3 |
FP013 -> FP0123 | ALU, BLENDI, PCMP, MIN/MAX | MMX, SSE, AVX, AVX2 | 3 per cycle | 4 per cycle |
FP2 Non-Variable Shift | PSHIFT | MMX, SSE AVX, AVX2 |
1 per clock | 2 per clock |
FP1 | VPSRLVD/Q VPSLLVD/Q |
AVX2 | 3 cycle latency 0.5 per clock |
1 cycle latency 2 per clock |
DWORD FP0 | MUL/SAD | MMX, SSE, AVX, AVX2 | 3 cycle latency 1 per clock |
3 cycle latency 2 per cycle |
DWORD FP0 | PMULLD | SSE, AVX, AVX2 | 4 cycle latency 0.25 per clock |
3 cycle latency 2 per clock |
WORD FP0 int MUL | PMULHW, PMULHUW, PMULLW | MMX, SSE, AVX, AVX2 | 3 cycle latency 1 per clock |
3 cycle latency 0.6 per clock |
FP0 int | PMADD, PMADDUBSW | MMX, SSE, AVX, AVX2 | 4 cycle latency 1 per clock |
3 cycle latency 2 per clock |
FP1 insts | (V)PERMILPS/D, PHMINPOSUW EXTRQ, INSERTQ |
SSE4a | 3 cycle latency 0.25 per clock |
3 cycle latency 2 per clock |
There are a few others not FP specific.
Zen3 Updates (6) Vector Integer Instructions |
||||
AnandTech | Instruction | Zen2 | Zen 3 | |
VPBLENDVB | xmm/ymm | Variable Blend Packed Bytes | 1 cycle latency 1 per cycle |
1 cycle latency 2 per cycle |
VPBROADCAST B/W/D/SS |
ymm<-xmm | Load and Broadcast | 4 cycle latency 1 per cycle |
2 cycle latency 1 per cycle |
VPBROADCAST Q/SD |
ymm<-xmm | Load and Broadcast | 1 cycle latency 1 per cycle |
2 cycle latency 1 per cycle |
VINSERTI128 VINSERTF128 |
ymm<-xmm | Insert Packed Values | 1 cycle latency 1 per cycle |
2 cycle latency 1 per cycle |
SHA1RNDS4 | Four Rounds of SHA1 | 6 cycle latency 0.25 per cycle |
6 cycle latency 0.5 per cycle |
|
SHA1NEXTE | Calculate SHA1 State | 1 cycle latency 1 per cycle |
1 cycle latency 2 per cycle |
|
SHA256RNDS2 | Four Rounds of SHA256 | 4 cycle latency 0.5 per cycle |
4 cycle latency 1 per cycle |
These last three are important for SHA cryptography. AMD, unlike Intel, does accelerated SHA so being able to reduce multiple instructions to a single instruction to help increase throughput and utilization should push them even further ahead. Rather than going for hardware accelerated SHA256, Intel instead prefers to use its AVX-512 unit, which unfortunately is a lot more power hungry and less efficient.
Vector Floats
We’ve already covered the improvements to the FMA latency, but there are also other improvements.
Zen3 Updates (7) Vector Float Instructions |
||||
AnandTech | Instruction | Zen2 | Zen 3 | |
DIVSS/PS | xmm, ymm | Divide FP32 Scalar/Packed |
10 cycle latency 0.286 per cycle |
10.5 cycle latency 0.444 per cycle |
DIVSD/PD | xmm, ymm | Divide FP64 Scalar/Packed |
13 cycle latency 0.200 per cycle |
13.5 cycle latency 0.235 per cycle |
SQRTSS/PS | xmm, ymm | Square Root FP32 Scalar/Packed |
14 cycle latency 0.181 per cycle |
14.5 cycle latency 0.273 per cycle |
SQRTSD/PD | xmm, ymm | Square Root FP64 Scalar/Packed |
20 cycle latency 0.111 per cycle |
20.5 cycle latency 0.118 per cycle |
RCPSS/PS | xmm, ymm | Reciprocal FP32 Scalar/Packed |
5 cycle latency 2 per cycle |
3 cycle latency 2 per cycle |
RSQRTSS/PS | xmm, ymm | Reciprocal FP32 SQRT Scalar/Pack |
5 cycle latency 2 per cycle |
3 cycle latency 2 per cycle |
VCVT* | xmm<-xmm | Convert | 3 cycle latency 1 per cycle |
3 cycle latency 2 per cycle |
VCVT* | xmm<-ymm ymm<-xmm |
Convert | 4 cycle latency 1 per cycle |
4 cycle latency 2 per cycle |
ROUND* | xmm, ymm | Round FP32/FP64 Scalar/Packed |
3 cycle latency 1 per cycle |
3 cycle latency 2 per cycle |
GATHER | 4x32 | Gather | 19 cycle latency 0.111 per cycle |
15 cycle latency 0.250 per cycle |
GATHER | 8x32 | Gather | 23 cycle latency 0.063 per cycle |
19 cycle latency 0.111 per cycle |
GATHER | 4x64 | Gather | 18 cycle latency 0.167 per cycle |
13 cycle latency 0.333 per cycle |
GATHER | 8x64 | Gather | 19 cycle latency 0.111 per cycle |
15 cycle latency 0.250 per cycle |
Along with these, store-to-load latencies have increased by a clock. AMD is promoting that it has improved store-to-load bandwidth with the new core, but that comes at additional latency.
Compared to some of the recent CPU launches, this is a lot of changes!
339 Comments
View All Comments
TheinsanegamerN - Tuesday, November 10, 2020 - link
However AMD's boost algorithim is very temperature sensitive. Those coolers may work fine, but if they get to the 70C range you're losing max performance to higher temperatures.Andrew LB - Sunday, December 13, 2020 - link
Blah blah....Ryzen 5800x @ 3.6-4.7ghz : 219w and 82'c.
Ryzen 5800x @ 4.7ghz locked: 231w and 88'c.
Fractal Celsius+ S28 Prisma 280mm AIO CPU cooler at full fan and pump speed
https://www.kitguru.net/components/cpu/luke-hill/a...
If you actually set your voltages on Intel chips they stay cool. My i7-10700k @ 5.0ghz all-core locked never goes above 70'c.
Count Rushmore - Friday, November 6, 2020 - link
It took 3 days... finally the article load-up.AT seriously need to upgrade their server (or I need to stop using IE6).
name99 - Friday, November 6, 2020 - link
"AMD wouldn’t exactly detail what this means but we suspect that this could allude to now two branch predictions per cycle instead of just one"So imagine you have wide OoO CPU. How do you design fetch? The current state of the art (and presumably AMD have aspects of this, though perhaps not the *entire* package) goes as follows:
Instructions come as runs of sequential instructions separated by branches. At a branch you may HAVE to fetch instructions from a new address (think call, goto, return) or you may perhaps continue to the next address (think non-taken branch).
So an intermediate complexity fetch engine will bring in blobs of instructions, up to (say 6 or 8) with the run of instructions terminating at
- I've scooped up N or
- I've hit a branch or
- I've hit the end of a cache line.
Basically every cycle should consist of pulling in the longest run of instructions possible subject to the above rules.
The way really advanced fetch works is totally decoupled from the rest of the CPU. Every cycle the fetch engine predicts the next fetch address (from some hierarchy of : check the link stack, check the BTB, increment the PC), and fetches as much as possible from that address. These are stuck in a queue connected to decode, and ideally that queue would never run dry.
BUT: on average there is about a branch every 6 instructions.
Now supposed you want to sustain, let's say, 8-wide. That means that you might set N at 8, but most of the time you'll fetch 6 or so instructions because you'll bail out based on hitting a branch before you have a full 8 instructions in your scoop. So you're mostly unable to go beyond an IPC of 6, even if *everything* else is ideal.
BUT most branches are conditional. And good enough half of those are not taken. This means that if you can generate TWO branch predictions per cycle then much of the time the first branch will not be taken, can be ignored, and fetch can continue in a straight line past it. Big win! Half the time you can pull in only 6 instructions, but the other half you could pull in maybe 12 instructions. Basically, if you want to sustain 8 wide, you'd probably want to pull in at least 10 or 12 instructions under best case conditions, to help fill up the queue for the cases where you pull in less than 8 instructions (first branch is taken, or you reach the end of the cache line).
Now there are some technicalities here.
One is "how does fetch know where the branches are, to know when to stop fetching". This is usually done via pre-decode bits living in the I-cache, and set by a kinda decode when the line is first pulled into the I-cache. (I think x86 also does this, but I have no idea how. It's obviously much easier for a sane ISA like ARM, POWER, even z.)
Second, and more interesting, is that you're actually performing two DIFFERENT TYPES of prediction, which makes it somewhat easier from a bandwidth point of view. The prediction on the first branch is purely "taken/not taken", and all you care about is "not taken"; the prediction on the second branch is more sophisticated because if you predict taken you also have to predict the target, which means dealing BTB or link stack.
But you don't have to predict TWO DIFFERENT "next fetch addresses" per cycle, which makes it somewhat easier.
Note also that any CPU that uses two level branch prediction is, I think, already doing two branch prediction per cycle, even if it doesn't look like it. Think about it: how do you USE a large (but slow) second level pool of branch prediction information?
You run the async fetch engine primarily from the first level; and this gives a constant stream of "runs of instructions, separated by branches" with zero delay cycles between runs. Great, zero cycle branches, we all want that. BUT for the predictors to generate a new result in a single cycle they can't be too large.
So you also run a separate engine, delayed a cycle or two, based on the larger pool of second level branch data, checking the predictions of the async engine. If there's a disagreement you flush whatever was fetched past that point (which hopefully is still just in the fetch queue...) and resteer. This will give you a one (or three or four) cycle bubble in the fetch stream, which is not ideal, but
- it doesn't happen that often
- it's a lot better catching a bad prediction very early in fetch, rather than much later in execution
- hopefully the fetch queue is full enough, and filled fast enough, that perhaps it's not even drained by the time decode has walked along it to the point at which the re-steer occurred...
This second (checking) branch prediction doesn't ever get mentioned, but it is there behind the scenes, even when the CPU is ostensibly doing only a single prediction per cycle.
There are other crazy things that happen in modern fetch engines (which are basically in themselves as complicated as a whole CPU from 20 years ago).
One interesting idea is to use the same data that is informing the async fetch engine to inform prefetch. The idea is that you now have essentially two fetch engines running. One is as I described above; the second ONLY cares about the stream of TAKEN branches, and follows that stream as rapidly as possible, ensuring that each line referenced by this stream is being pulled into the I-cache. (You will recognize this as something like a very specialized form of run-ahead.)
In principle this should be perfect -- the I prefetcher and branch-prediction are both trying to solve the *exact* same problem, so pooling their resources should be optimal! In practice, so far this hasn't yet been perfected; the best simulations using this idea are a very few percent behind the best simulations using a different I prefetch technology. But IMHO this is mostly a consequence of this being a fairly new idea that has so far been explored mainly by using pre-existing branch predictors, rather than designing a branch predictor store that's optimal for both tasks.
The main difference is that what matters for prefetching is "far future" branches, branches somewhat beyond where I am now, so that there's plenty of time to pull in the line all the way from RAM. And existing branch predictors have had no incentive to hold onto that sort of far future prediction state. HOWEVER
A second interesting idea is what IBM has been doing for two or three years now. They store branch prediction in what they call an L2 storage but, to avoid things, I'll cal a cold cache. This is stale/far future branch prediction data that is unused for a while but, on triggering events, that cold cache data will be swapped into the branch prediction storage so that the branch predictors are ready to go for the new context in which they find themselves.
I don't believe IBM use this to drive their I-prefetcher, but obviously it is a great solution to the problem I described above and I suspect this will be where all the performance CPUs eventually find themselves over the next few years. (Apple and IBM probably first, because Apple is Apple, and IBM has the hard part of the solution already in place; then ARM because they's smart and trying hard; then AMD because they're also smart but their technology cycles are slower than ARM; and final Intel because, well, they're Intel and have been running on fumes for a few years now.)
(Note of course this only solves I-prefetch, which is nice and important; but D-prefetch remains as a difficult and different problem.)
name99 - Friday, November 6, 2020 - link
Oh, one more thing. I referred to "width" of the CPU above. This becomes an ever vaguer term every year. The basic points are two:- when OoO started, it seemed reasonable to scale every step of the pipeline together. Make the CPU 4-wide. So it can fetch up to 4 instructions/cycle. decode up to 4, issue up to 4, retire up to 4. BUT if you do this you're losing performance every step of the way. Every cycle that fetches only 3 instructions can never make that up; likewise every cycle that only issues 3 instructions.
- so once you have enough transistors available for better designs, you need to ask yourself what's the RATE-LIMITING step? For x86 that's probably in fetch and decode, but let's consider sane ISAs like ARM. There the rate limiting step is probably register rename. So lets assume your max rename bandwidth is 6 instructions/cycle. You actually want to run the rest of your machinery at something like 7 or 8 wide because (by definition) you CAN do so (they are not rate limiting, so they can be grown). And by running them wider you can ensure that the inevitable hiccups along the way are mostly hidden by queues, and your rename machinery is running at full speed, 6-wide each and every cycle, rather than frequently running at 5 or 4 wide because of some unfortunate glitch upstream.
Spunjji - Monday, November 9, 2020 - link
These were interesting posts. Thank you!GeoffreyA - Monday, November 9, 2020 - link
Yes, excellent posts. Thanks.Touching on width, I was expecting Zen 3 to add another decoder and take it up to 5-wide decode (like Skylake onwards). Zen 3's keeping it at 4 makes good sense though, considering their constraint of not raising power. Another decoder might have raised IPC but would have likely picked up power quite a bit.
ignizkrizalid - Saturday, November 7, 2020 - link
Rip Intel no matter how hard you try squeezing Intel sometimes on top within your graphics! stupid site bias and unreliable if this site was to be truth why not do a live video comparison side by side using 3600 or 4000Mhz ram so we can see the actual numbers and be 100% assured the graphic table is not manipulated in any way, yea I know you will never do it! personally I don't trust these "reviews" that can be manipulated as desired, I respect live video comparison with nothing to hide to the public. Rip Intel Rip Intel.Spunjji - Monday, November 9, 2020 - link
I... don't think this makes an awful lots of sense, tbh.MDD1963 - Saturday, November 7, 2020 - link
It would be interesting to also see the various results of the 10900K the way most people actually run them on Z490 boards, i.e, with higher RAM clocks, MCE enabled, etc...; do the equivalent tuning with 5000 series, I'm sure they will run with faster than DDR4-3200 MHz. plus perhaps a small all-core overclock.