The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Name: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity
Item: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

by Dr. Ian Cutress & Andrei Frumusanu on November 4, 2021 9:00 AM EST

474 Comments | Add A Comment

474 Comments

Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
Triple load: (almost) universal
- every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
- BROADCAST*, GATHER*, PREFETCH* also gains
Decoupled double FADD units
- every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
- Another ADD/SUB means 4->2 clks
- Another MUL means 4->3 clks
- AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
- exception: half precision ADD/SUB handled by FMAs
- exception: x87 FADD remained 3 clks
Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
- LEA r64, [r64+imm8]
- ADD r64, imm8
- ADD r64, imm32
- INC r64
- Is this just for 64b addition GPRs?
eliminated instructions:
- MOV r32/r64
- (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
- 0-5 0x66 NOP
- LNOP3-7
- CLC/STC
zeroing idioms:
- (V)XORPS/PD, (V)PXOR xmm, ymm
- (V)PSUB(U)B/W/D/Q xmm
- (V)PCMPGTB/W/D/Q xmm
- (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

LOCK latency 20->18 clks
LEA with scale throughput 2->3/clk
(I)MUL r8 latency 4->3 clks
LAHF latency 3->1 clks
CMPS* latency 5->4 clks
REP CMPSB 1->3.7 Bytes/clock
REP SCASB 0.5->1.85 Bytes/clock
REP MOVS* 115->122 Bytes/clock
CMPXVHG16B 20|20 -> 16|14
PREFETCH* throughput 1->3/clk
ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
SHA1RNDS4 latency 6->4
SHA1MSG2 throughput 0.2->0.25/clock
SHA256MSG2 11|5->6|2
ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

*FADD xmm/ymm latency 4->3 clks (after MUL)
*FADD xmm/ymm latency 4->2 clks(after ADD)
* means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
VADD/SUB/PS/PD zmm 4|1->3.3|0.5
CLMUL xmm 6|1->3|1
CLMUL ymm, zmm 8|2->3|1
VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
PAUSE latency 140->160 clocks
LEA with scale latency 2->3 clocks
(I)DIV r8 latency 15->17 clocks
FXCH throughput 2->1/clock
LFENCE latency 6->12 clocks
VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
(V)AESKEYGEN latency 12->13 clocks
VCVTPS2PH/PH2PS latency 5->6 clocks
BZHI throughput 2->1/clock
VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

E-core: Gracemont vs Tremont

Microarchitecture Changes:

Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
Zen2-like memory renaming with GPRs
New zeroing idioms
- SUB r32, r32
- SUB r64, r64
- CDQ, CQO
- (V)PSUBB/W/D/Q/SB/SW/USB/USW
- (V)PCMPGTB/W/D/Q
New ones idiom: (V)PCMPEQB/W/D/Q
MOV elimination: MOV; MOVZX; MOVSX r32, r64
NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

PAUSE latency 158->62 clocks
MOVSX; SHL/R r, 1; SHL/R r,imm8 tp 1->0.25
ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
CMOV* throughput 1->2 per clock
RCR r, 1 10|10 -> 2|2
RCR/RCL r, imm/cl 13|13->11|11
SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
MOVBE latency 1->0.5 clocks
(I)MUL r32 3|1 -> 3|0.5
(I)MUL r64 5|2 -> 5|0.5
REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
PShift (x)mm, (x)mm 2|1 -> 1|.33
PMUL*, PSADBW mm, xmm 4|1 -> 3|1
ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
MULPS/PD 4|1 -> 4|0.5
CVT*, ROUND xmm, xmm 4|1 -> 3|1
BLENDV* xmm, xmm 3|2 -> 3|0.88
AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
SHA256RNDS2 5|2 -> 4|1
PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

m8, m16 load latency 4->5 clocks
ADD/MOVBE load latency 4->5 clocks
LOCK ADD 16|16->18|18
XCHG mem 17|17->18|18
(I)DIV +1 clock
DPPS 10|1.5 -> 18|6
DPPD 6|1 -> 10|3.5
FSIN/FCOS +12% slower

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

474 Comments

View All Comments

mode_13h - Tuesday, November 9, 2021 - link
Well, AMD does have V-Cache and Zen 3+ in the queue. But if you want to short them, be my guest!
Sivar - Monday, November 8, 2021 - link
This is an amazingly deep, properly Anandtech review, even ignoring time constraints and the unusual difficulty of this particular launch.
I bet Ian and Andrei will be catching up on sleep for weeks.
xhris4747 - Tuesday, November 9, 2021 - link
Hi
ricebunny - Tuesday, November 9, 2021 - link
It’s disappointing that Anandtech continues to use suboptimal compilers for their platforms. Intel’s Compiler classic demonstrated 41% better performance than Clang 12.0.0 in the SPECrate 2017 Floating Point suite.
mode_13h - Wednesday, November 10, 2021 - link
I think it's fair, though. Most workloads people run aren't built with vendor-supplied compilers, they use industry standards of gcc, clang, or msvc. And the point of benchmarks it to give you an idea of what the typical user experience would be.
ricebunny - Wednesday, November 10, 2021 - link
But are they not compiling the code for the M1 series chips with a vendor supplied compiler?

Second, almost all benchmarks in SPECrate 2017 Floating Point are scientific codes, half of which are in Fortran. That’s exactly the target domain of the Intel compiler. I admit, I am out of date with the HPC developments, but back when I was still in the game icc was the most commonly used compiler.
mode_13h - Thursday, November 11, 2021 - link
> are they not compiling the code for the M1 series chips with a vendor supplied compiler?

It's just a slightly newer version of LLVM than what you'd get on Linux.

> almost all benchmarks in SPECrate 2017 Floating Point are scientific codes,

3 are rendering, animation, and image processing. Some of the others could fall more in the category of engineering than scientific, but whatever.

> half of which are in Fortran.

Only 3 are pure fortran. Another 4 are some mixture, but we don't know the relative amounts. They could literally link in BLAS or some FFT code for some trivial setup computation, and that would count as including fortran.

https://www.spec.org/cpu2017/Docs/index.html#intra...

BTW, you conveniently ignored how only one of the SPECrate 2017 int tests is fortran.
mode_13h - Thursday, November 11, 2021 - link
Oops, I accidentally counted one test that's only SPECspeed.

So, in SPECrate 2017 fp:

3 are fortran
3 are fortran & C/C++
7 are only C/C++
ricebunny - Thursday, November 11, 2021 - link
Yes, I made the same mistake when counting.

Without knowing what the Fortran code in the mixed code represents I would not discard it as irrelevant: those tests could very well spend a majority of their time executing Fortran.

As for the int tests, the advantage of the Intel compiler was even more pronounced: almost 50% over Clang. IMO this is too significant to ignore.

If I ran these tests, I would provide results from multiple compilers. I would also consult with the CPU vendors regarding the recommended compiler settings. Anandtech refuses to compile code with AVX512 support for non Alder Lake Intel chips, whereas Intel’s runs of SPECrate2017 enable that switch?
xray9 - Sunday, November 14, 2021 - link
> At Intel’s Innovation event last week, we learned that the operating system
> will de-emphasise any workload that is not in user focus.

I see performance critical for audio applications which need near-real time performance.
It's already a pain to find good working drivers that do not allocate CPU core for too long, not to block processes with near-realtime demands.
And for performance tuning we use already the Windows option to priotize for background processes, which gives the process scheduler a higher and fix time quantum, to be able to work more efficient on processes and to lower the number of context switches.
And now we get this hybrid design where everything becomes out of control and you can only hope and pray, that the process scheduling will not be too bad. I am not amused about that and very skeptical, that this will work out well.

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Instruction Changes

P-core: Golden Cove vs Cypress Cove

E-core: Gracemont vs Tremont

Post Your Comment

474 Comments

View All Comments

mode_13h - Tuesday, November 9, 2021 - link

Sivar - Monday, November 8, 2021 - link

xhris4747 - Tuesday, November 9, 2021 - link

ricebunny - Tuesday, November 9, 2021 - link

mode_13h - Wednesday, November 10, 2021 - link

ricebunny - Wednesday, November 10, 2021 - link

mode_13h - Thursday, November 11, 2021 - link

mode_13h - Thursday, November 11, 2021 - link

ricebunny - Thursday, November 11, 2021 - link

xray9 - Sunday, November 14, 2021 - link

Log in

Don't have an account? Sign up now