AMD Zen 3 Ryzen Deep Dive Review: 5950X, 5900X, 5800X and 5600X Tested
by Dr. Ian Cutress on November 5, 2020 9:01 AM ESTNew and Improved Instructions
When it comes to instruction improvements, moving to a brand new ground-up core enables a lot more flexibility in how instructions are processed compared to just a core update. Aside from adding new security functionality, being able to rearchitect the decoder/micro-op cache, the execution units, and the number of execution units allows for a variety of new features and hopefully faster throughput.
As part of the microarchitecture deep-dive disclosures from AMD, we naturally get AMD’s messaging on the improvements in this area – we were told of the highlights, such as the improved FMAC and new AVX2/AVX256 expansions. There’s also Control-Flow Enforcement Technology (CET) which enables a shadow stack to protect against ret/ROP attacks. However after getting our hands on the chip, there’s a trove of improvements to dive through.
Let’s cover AMD’s own highlights first.
The top cover item is the improved Fused Multiply-Accumulate (FMA), which is a frequently used operation in a number of high-performance compute workloads as well as machine learning, neural networks, scientific compute and enterprise workloads.
In Zen 2, a single FMA took 5 cycles with a throughput of 2/clock.
In Zen 3, a single FMA takes 4 cycles with a throughput of 2/clock.
This means that AMD’s FMAs are now on parity with Intel, however this update is going to be most used in AMD’s EPYC processors. As we scale up this improvement to the 64 cores of the current generation EPYC Rome, any compute-limited workload on Rome should be freed in Naples. Combine that with the larger L3 cache and improved load/store, some workloads should expect some good speed ups.
The other main update is with cryptography and cyphers. In Zen 2, vector-based AES and PCLMULQDQ operations were limited to AVX / 128-bit execution, whereas in Zen 3 they are upgraded to AVX2 / 256-bit execution.
This means that VAES has a latency of 4 cycles with a throughput of 2/clock.
This means that VPCLMULQDQ has a latency of 4 cycles, with a throughput of 0.5/clock.
AMD also mentioned to a certain extent that it has increased its ability to process repeated MOV instructions on short strings – what used to not be so good for short copies is now good for both small and large copies. We detected that the new core performs better REP MOV instruction elimination at the decode stage, leveraging the micro-op cache better.
Now here’s the stuff that AMD didn’t talk about.
Integer
Sticking with instruction elimination, a lot of instructions and zeroing idioms that Zen 2 used to decode but then skip execution are now detected and eliminated at the decode stage.
- NOP (90h) up to 5x 66h
- LNOP3/4/5 (Looped NOP)
- (V)MOVAPS/MOVAPD/MOVUPS/MOVUPD vec1, vec1 : Move (Un)Aligned Packed FP32/FP64
- VANDNPS/VANDNPD vec1, vec1, vec1 : Vector bitwise logical AND NOT Packed FP32/FP64
- VXORPS/VXORPD vec1, vec1, vec1 : Vector bitwise logical XOR Packed FP32/FP64
- VPANDN/VPXOR vec1, vec1, vec1 : Vector bitwise logical (AND NOT)/XOR
- VPCMPGTB/W/D/Q vec1, vec1, vec1 : Vector compare packed integers greater than
- VPSUBB/W/D/Q vec1, vec1, vec1 : Vector subtract packed integers
- VZEROUPPER : Zero upper bits of YMM
- CLC : Clear Carry Flag
As for direct performance adjustments, we detected the following:
Zen3 Updates (1) Integer Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
XCHG | Exchange Register/Memory with Register |
17 cycle latency | 7 cycle latency |
LOCK (ALU) | Assert LOCK# Signal | 17 cycle latency | 7 cycle latency |
ALU r16/r32/r64 imm | ALU on constant | 2.4 per cycle | 4 per cycle |
SHLD/SHRD | FP64 Shift Left/Right | 4 cycle latency 0.33 per cycle |
2 cycle latency 0.66 per cycle |
LEA [r+r*i] | Load Effective Address | 2 cycle latency 2 per cycle |
1 cycle latency 4 per cycle |
IDIV r8 | Signed Integer Division | 16 cycle latency 1/16 per cycle |
10 cycle latency 1/10 per cycle |
DIV r8 | Unsigned Integer Division | 17 cycle latency 1/17 per cycle |
|
IDIV r16 | Signed Integer Division | 21 cycle latency 1/21 per cycle |
12 cycle latency 1/12 per cycle |
DIV r16 | Unsigned Integer Division | 22 cycle latency 1/22 per cycle |
|
IDIV r32 | Signed Integer Division | 29 cycle latency 1/29 per cycle |
14 cycle latency 1/14 per cycle |
DIV r32 | Unsigned Integer Division | 30 cycle latency 1/30 per cycle |
|
IDIV r64 | Signed Integer Division | 45 cycle latency 1/45 per cycle |
19 cycle latency 1/19 per cycle |
DIV r64 | Unsigned Integer Division | 46 cycle latency 1/46 cycle latency |
20 cycle latency 1/20 per cycle |
Zen3 Updates (2) Integer Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
LAHF | Load Status Flags into AH Register |
2 cycle latency 0.5 per cycle |
1 cycle latency 1 per cycle |
PUSH reg | Push Register Onto Stack | 1 per cycle | 2 per cycle |
POP reg | Pop Value from Stack Into Register |
2 per cycle | 3 per cycle |
POPCNT | Count Bits | 3 per cycle | 4 per cycle |
LZCNT | Count Leading Zero Bits | 3 per cycle | 4 per cycle |
ANDN | Logical AND | 3 per cycle | 4 per cycle |
PREFETCH* | Prefetch | 2 per cycle | 3 per cycle |
PDEP/PEXT | Parallel Bits Deposit/Extreact |
300 cycle latency 250 cycles per 1 |
3 cycle latency 1 per clock |
It’s worth highlighting those last two commands. Software that helps the prefetchers, due to how AMD has arranged the branch predictors, can now process three prefetch commands per cycle. The other element is the introduction of a hardware accelerator with parallel bits: latency is reduced 99% and throughput is up 250x. If anyone asks why we ever need extra transistors for modern CPUs, it’s for things like this.
There are also some regressions
Zen3 Updates (3) Slower Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
CMPXCHG8B | Compare and Exchange 8 Byte/64-bit |
9 cycle latency 0.167 per cycle |
11 cycle latency 0.167 per cycle |
BEXTR | Bit Field Extract | 3 per cycle | 2 per cycle |
BZHI | Zero High Bit with Position | 3 per cycle | 2 per cycle |
RORX | Rorate Right Logical Without Flags |
3 per cycle | 2 per cycle |
SHLX / SHRX | Shift Left/Right Without Flags |
3 per cycle | 2 per cycle |
As always, there are trade offs.
x87
For anyone using older mathematics software, it might be riddled with a lot of x87 code. x87 was originally meant to be an extension of x86 for floating point operations, but based on other improvements to the instruction set, x87 is somewhat deprecated, and we often see regressed performance generation on generation.
But not on Zen 3. Among the regressions, we’re also seeing some improvements. Some.
Zen3 Updates (4) x87 Instructions |
|||
AnandTech | Instruction | Zen2 | Zen 3 |
FXCH | Exchange Registers | 2 per cycle | 4 per cycle |
FADD | Floating Point Add | 5 cycle latency 1 per cycle |
6.5 cycle latency 2 per cycle |
FMUL | Floating Point Multiply | 5 cycle latency 1 per cycle |
6.5 cycle latency 2 per cycle |
FDIV32 | Floating Point Division | 10 cycle latency 0.285 per cycle |
10.5 cycle latency 0.800 per cycle |
FDIV64 | 13 cycle latency 0.200 per cycle |
13.5 cycle latency 0.235 per cycle |
|
FDIV80 | 15 cycle latency 0.167 per cycle |
15.5 cycle latency 0.200 per cycle |
|
FSQRT32 | Floating Point Square Root |
14 cycle latency 0.181 per cycle |
14.5 cycle latency 0.200 per cycle |
FSQRT64 | 20 cycle latency 0.111 per cycle |
20.5 cycle latency 0.105 per cycle |
|
FSQRT80 | 22 cycle latency 0.105 per cycle |
22.5 cycle latency 0.091 per cycle |
|
FCOS 0.739079 |
cos X = X | 117 cycle latency 0.27 per cycle |
149 cycle latency 0.28 per cycle |
The FADD and FMUL improvements mean the most here, but as stated, using x87 is not recommended. So why is it even mentioned here? The answer lies in older software. Software stacks built upon decades old Fortran still use these instructions, and more often than not in high performance math codes. Increasing throughput for the FADD/FMUL should provide a good speed up there.
Vector Integers
All of the vector integer improvements fall into two main categories. Aside from latency improvements, some of these improvements are execution port specific – due to the way the execution ports have changed this time around, throughput has improved for large numbers of instructions.
Zen3 Updates (5) Port Vector Integer Instructions |
||||
AnandTech | Instruction | Vector | Zen2 | Zen 3 |
FP013 -> FP0123 | ALU, BLENDI, PCMP, MIN/MAX | MMX, SSE, AVX, AVX2 | 3 per cycle | 4 per cycle |
FP2 Non-Variable Shift | PSHIFT | MMX, SSE AVX, AVX2 |
1 per clock | 2 per clock |
FP1 | VPSRLVD/Q VPSLLVD/Q |
AVX2 | 3 cycle latency 0.5 per clock |
1 cycle latency 2 per clock |
DWORD FP0 | MUL/SAD | MMX, SSE, AVX, AVX2 | 3 cycle latency 1 per clock |
3 cycle latency 2 per cycle |
DWORD FP0 | PMULLD | SSE, AVX, AVX2 | 4 cycle latency 0.25 per clock |
3 cycle latency 2 per clock |
WORD FP0 int MUL | PMULHW, PMULHUW, PMULLW | MMX, SSE, AVX, AVX2 | 3 cycle latency 1 per clock |
3 cycle latency 0.6 per clock |
FP0 int | PMADD, PMADDUBSW | MMX, SSE, AVX, AVX2 | 4 cycle latency 1 per clock |
3 cycle latency 2 per clock |
FP1 insts | (V)PERMILPS/D, PHMINPOSUW EXTRQ, INSERTQ |
SSE4a | 3 cycle latency 0.25 per clock |
3 cycle latency 2 per clock |
There are a few others not FP specific.
Zen3 Updates (6) Vector Integer Instructions |
||||
AnandTech | Instruction | Zen2 | Zen 3 | |
VPBLENDVB | xmm/ymm | Variable Blend Packed Bytes | 1 cycle latency 1 per cycle |
1 cycle latency 2 per cycle |
VPBROADCAST B/W/D/SS |
ymm<-xmm | Load and Broadcast | 4 cycle latency 1 per cycle |
2 cycle latency 1 per cycle |
VPBROADCAST Q/SD |
ymm<-xmm | Load and Broadcast | 1 cycle latency 1 per cycle |
2 cycle latency 1 per cycle |
VINSERTI128 VINSERTF128 |
ymm<-xmm | Insert Packed Values | 1 cycle latency 1 per cycle |
2 cycle latency 1 per cycle |
SHA1RNDS4 | Four Rounds of SHA1 | 6 cycle latency 0.25 per cycle |
6 cycle latency 0.5 per cycle |
|
SHA1NEXTE | Calculate SHA1 State | 1 cycle latency 1 per cycle |
1 cycle latency 2 per cycle |
|
SHA256RNDS2 | Four Rounds of SHA256 | 4 cycle latency 0.5 per cycle |
4 cycle latency 1 per cycle |
These last three are important for SHA cryptography. AMD, unlike Intel, does accelerated SHA so being able to reduce multiple instructions to a single instruction to help increase throughput and utilization should push them even further ahead. Rather than going for hardware accelerated SHA256, Intel instead prefers to use its AVX-512 unit, which unfortunately is a lot more power hungry and less efficient.
Vector Floats
We’ve already covered the improvements to the FMA latency, but there are also other improvements.
Zen3 Updates (7) Vector Float Instructions |
||||
AnandTech | Instruction | Zen2 | Zen 3 | |
DIVSS/PS | xmm, ymm | Divide FP32 Scalar/Packed |
10 cycle latency 0.286 per cycle |
10.5 cycle latency 0.444 per cycle |
DIVSD/PD | xmm, ymm | Divide FP64 Scalar/Packed |
13 cycle latency 0.200 per cycle |
13.5 cycle latency 0.235 per cycle |
SQRTSS/PS | xmm, ymm | Square Root FP32 Scalar/Packed |
14 cycle latency 0.181 per cycle |
14.5 cycle latency 0.273 per cycle |
SQRTSD/PD | xmm, ymm | Square Root FP64 Scalar/Packed |
20 cycle latency 0.111 per cycle |
20.5 cycle latency 0.118 per cycle |
RCPSS/PS | xmm, ymm | Reciprocal FP32 Scalar/Packed |
5 cycle latency 2 per cycle |
3 cycle latency 2 per cycle |
RSQRTSS/PS | xmm, ymm | Reciprocal FP32 SQRT Scalar/Pack |
5 cycle latency 2 per cycle |
3 cycle latency 2 per cycle |
VCVT* | xmm<-xmm | Convert | 3 cycle latency 1 per cycle |
3 cycle latency 2 per cycle |
VCVT* | xmm<-ymm ymm<-xmm |
Convert | 4 cycle latency 1 per cycle |
4 cycle latency 2 per cycle |
ROUND* | xmm, ymm | Round FP32/FP64 Scalar/Packed |
3 cycle latency 1 per cycle |
3 cycle latency 2 per cycle |
GATHER | 4x32 | Gather | 19 cycle latency 0.111 per cycle |
15 cycle latency 0.250 per cycle |
GATHER | 8x32 | Gather | 23 cycle latency 0.063 per cycle |
19 cycle latency 0.111 per cycle |
GATHER | 4x64 | Gather | 18 cycle latency 0.167 per cycle |
13 cycle latency 0.333 per cycle |
GATHER | 8x64 | Gather | 19 cycle latency 0.111 per cycle |
15 cycle latency 0.250 per cycle |
Along with these, store-to-load latencies have increased by a clock. AMD is promoting that it has improved store-to-load bandwidth with the new core, but that comes at additional latency.
Compared to some of the recent CPU launches, this is a lot of changes!
339 Comments
View All Comments
halcyon - Tuesday, November 10, 2020 - link
1. Ryzen 9 5xxx series dominate most gaming benhmarks in CPU bound games up to 720p2. However at 1440P/4K Intel, esp. 10850K pull ahead.
Can somebody explain this anomaly? As Games become more GPU bound at higher res, why does Intel pull ahead (with worse single/multi-thread CPU perf)? Is it a bandwidth/latency issue? If so, where exactly (RAM? L3? somewhere else)? Can't be PCIe, can it?
feka1ity - Saturday, November 14, 2020 - link
RAM. anandtech uses shitty ram for intel systemsMakste - Monday, November 16, 2020 - link
I think the game optimizations for intel processors become clear at those resolutions. AMD has been a none factor in gaming for so long. These games have been developed on and mostly optimised to work better on intel machinesSilma - Wednesday, November 11, 2020 - link
At 4K, the 3700X beats the 5600X quite often.Samus - Friday, November 13, 2020 - link
Considering Intel just released a new generation of CPU's, it's astonishing at their current IPC generation-over-generation trajectory, it will take them two more generations to surpass Zen 3. That's almost 2 years.Wow.
ssshenoy - Tuesday, December 15, 2020 - link
I dont think this article compares the latest generation from Intel - the Willow Cove core in Tiger lake which is launched only for notebooks. The comparison here seems to be with the ancient Skylake generation on 14 nm.abufrejoval - Friday, November 13, 2020 - link
Got my Ryzen 7 5800X on a new Aorus X570 mainboard and finally working, too.It turbos to 4850MHz without any overclocking, so I'd hazard 150MHz "bonus" are pretty much the default across the line.
At the wall plug 210 Watts was the biggest load I observed for pure CPU loads. HWinfo never reporting anything in excess of 120 Watts on the CPU from internal sensors.
"finally working": I want ECC with this rig, because I am aiming for 64GB or even 128GB RAM and 24x7 operation. Ordered DDR4-3200 ECC modules from Kingston to go with the board. Those seem a little slow coming so I tried to make do with pilfering some DIMMs from other systems, that could be shut down for a moment. DDR4-2133 ECC and DDR4-2400 ECC modules where candidates, but wouldn't boot...
Both were 2Rx4, dual rank, nibble not byte organized modules, unbuffered and unregistered but not the byte organized DIMMs that the Gigabyte documentation seeemd to prescribe... Asus, MSI and ASrock don't list such constraints, but I had to go with availability...
I like to think of RAM as RAM, it may be slower or faster, but it shouldn't be tied to one specific system, right?
So while I await the DDR4-3200 ECC 32GB modules to arrive, I got myself some DDR4-4000 R1x8 (no ECC, 8GB) DIMMs to fill the gap: But would that X570 mainboard, which might have been laying on shelves for months actually boot a Ryzen 5000?
No, it wouldn't.
But yes, it would update the BIOS via Q-Flash Plus-what-shall-we-call-it and then, yes, it did indeed recognize both the CPU and those R1x8 DIMMs just fine after the update.
I haven't yet tried those R2x4 modules again, because I am still exploring the bandwidth high-end, but I want to report just how much I am impressed by the compatibility of the AM4 platform, fully aware that Zen 3 will be the last generation in this "sprint".
I vividly remember how I had to get Skylake CPUs in order to get various mainboard ready for Kaby Lake...
I have been using AMD x86 CPUs from 80486DX4. I owned every iteration of K6-II and K6-III, omitted all Slot-A variants, got back with socket-A, 754, 939, went single, quad, and hexa (Phenom II x4+x6), omitted Bulldozer, but did almost every APU but between Kaveri and Zen 3, AMD simply wasn't compelling enough.
I would have gotten a Ryzen 9 5950x, if it had been available. But I count myself lucky for the moment to have snatched a Ryzen 7 5800X: It sure doesn't disappoint.
AMD a toast! You have done very well indeed and you can count me impressed!
Of course I'll nag about missing SVE/MKTME support day after tomorrow, but in the mean-time, please accept my gratitude.
feka1ity - Saturday, November 14, 2020 - link
Interesting, my default 9700k with 1080ti does 225fps avg - Borderlands 3, 360p, very low settings and anantech testers poop 175fps avg with 10900k and 2080ti?!? And this favoritize amede products. Fake stuff, sorry.Spunjji - Monday, November 16, 2020 - link
"Fake stuff"Thanks for labelling your post
feka1ity - Monday, November 16, 2020 - link
Fake stuff is not a label, it's a epicrisis. Go render stuff, spunji