The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Name: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity
Item: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

by Dr. Ian Cutress & Andrei Frumusanu on November 4, 2021 9:00 AM EST

474 Comments | Add A Comment

474 Comments

Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
Triple load: (almost) universal
- every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
- BROADCAST*, GATHER*, PREFETCH* also gains
Decoupled double FADD units
- every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
- Another ADD/SUB means 4->2 clks
- Another MUL means 4->3 clks
- AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
- exception: half precision ADD/SUB handled by FMAs
- exception: x87 FADD remained 3 clks
Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
- LEA r64, [r64+imm8]
- ADD r64, imm8
- ADD r64, imm32
- INC r64
- Is this just for 64b addition GPRs?
eliminated instructions:
- MOV r32/r64
- (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
- 0-5 0x66 NOP
- LNOP3-7
- CLC/STC
zeroing idioms:
- (V)XORPS/PD, (V)PXOR xmm, ymm
- (V)PSUB(U)B/W/D/Q xmm
- (V)PCMPGTB/W/D/Q xmm
- (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

LOCK latency 20->18 clks
LEA with scale throughput 2->3/clk
(I)MUL r8 latency 4->3 clks
LAHF latency 3->1 clks
CMPS* latency 5->4 clks
REP CMPSB 1->3.7 Bytes/clock
REP SCASB 0.5->1.85 Bytes/clock
REP MOVS* 115->122 Bytes/clock
CMPXVHG16B 20|20 -> 16|14
PREFETCH* throughput 1->3/clk
ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
SHA1RNDS4 latency 6->4
SHA1MSG2 throughput 0.2->0.25/clock
SHA256MSG2 11|5->6|2
ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

*FADD xmm/ymm latency 4->3 clks (after MUL)
*FADD xmm/ymm latency 4->2 clks(after ADD)
* means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
VADD/SUB/PS/PD zmm 4|1->3.3|0.5
CLMUL xmm 6|1->3|1
CLMUL ymm, zmm 8|2->3|1
VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
PAUSE latency 140->160 clocks
LEA with scale latency 2->3 clocks
(I)DIV r8 latency 15->17 clocks
FXCH throughput 2->1/clock
LFENCE latency 6->12 clocks
VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
(V)AESKEYGEN latency 12->13 clocks
VCVTPS2PH/PH2PS latency 5->6 clocks
BZHI throughput 2->1/clock
VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

E-core: Gracemont vs Tremont

Microarchitecture Changes:

Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
Zen2-like memory renaming with GPRs
New zeroing idioms
- SUB r32, r32
- SUB r64, r64
- CDQ, CQO
- (V)PSUBB/W/D/Q/SB/SW/USB/USW
- (V)PCMPGTB/W/D/Q
New ones idiom: (V)PCMPEQB/W/D/Q
MOV elimination: MOV; MOVZX; MOVSX r32, r64
NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

PAUSE latency 158->62 clocks
MOVSX; SHL/R r, 1; SHL/R r,imm8 tp 1->0.25
ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
CMOV* throughput 1->2 per clock
RCR r, 1 10|10 -> 2|2
RCR/RCL r, imm/cl 13|13->11|11
SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
MOVBE latency 1->0.5 clocks
(I)MUL r32 3|1 -> 3|0.5
(I)MUL r64 5|2 -> 5|0.5
REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
PShift (x)mm, (x)mm 2|1 -> 1|.33
PMUL*, PSADBW mm, xmm 4|1 -> 3|1
ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
MULPS/PD 4|1 -> 4|0.5
CVT*, ROUND xmm, xmm 4|1 -> 3|1
BLENDV* xmm, xmm 3|2 -> 3|0.88
AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
SHA256RNDS2 5|2 -> 4|1
PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

m8, m16 load latency 4->5 clocks
ADD/MOVBE load latency 4->5 clocks
LOCK ADD 16|16->18|18
XCHG mem 17|17->18|18
(I)DIV +1 clock
DPPS 10|1.5 -> 18|6
DPPD 6|1 -> 10|3.5
FSIN/FCOS +12% slower

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

474 Comments

View All Comments

Kvaern1 - Sunday, November 7, 2021 - link
Because there are no games which are 'incompatible'' with ADL.
eastcoast_pete - Sunday, November 7, 2021 - link
While AL is an interesting CPU (regardless of what one's preference is), I still think the star of AL is the Gracemont core (E cores), and did some very simple-minded, back of a napkin calculations. The top AL has 8 (P cores with multithreading) = 16 + 8 E core threads (no multithreading here) for a total of 24 threads. According to first die shots, one P core requires the same die area as 4 E cores. That leaves me wanting an all-E core CPU with the same die size as the i9 AL, because that could fit 8x4= 32 plus the existing 8 Gracemonts, for a total of 40. And, the old problem of "Atoms can't do AVX and AVX2" is solved - because now they can! Yes, single thread performance would be significantly lower, but any workload that can take advantage of many threads should be at least as fast as on the i9. Anyone here knows if Intel is considering that? It wouldn't be the choice for gaming, but for productivity, it might give both the i9 and, possibly, the 5950x a run for the money.
mode_13h - Monday, November 8, 2021 - link
They currently make Atom-branded embedded server CPUs with up to 24 cores. This one launched last year, using Tremont cores:

https://ark.intel.com/content/www/us/en/ark/produc...

I think you can expect to see a Gracemont-based refresh, possibly with some new product lines expanding into non-embedded markets.
eastcoast_pete - Monday, November 8, 2021 - link
Yes, those Tremont-based CPUs are intended/sold for 5G cell stations; I hope that Intel doesn't just refresh those with Gracemont, but makes a 32-40 Gracemont core CPU available for workstations and servers. The one thing that might prevent that is fear (Intel's) of cannibalizing their Sapphire Rapid sales. However, if I would be in their shoes, I'd worry more about upcoming AMD and multi-core ARM server chips, and sell all the CPUs they can.
mode_13h - Tuesday, November 9, 2021 - link
Well, it's a start that Intel is already using these cores in *some* kind of server CPU, no? That suggests they already should have some server-grade RAS features built-in. So, it should be a fairly small step to use them in a high core count CPU to counter the Gravitons and Altras. I think they will, since it should be more competitive in terms of perf/W.

As for workstations, I think you'll need to find a workstation board with a server CPU socket. I doubt they'll be pushing massive E-core -only CPUs specifically for workstations, since workstation users also tend to care about single-thread performance.
anemusek - Sunday, November 7, 2021 - link
Sorry but performance it isn't all +- a few percent in the real world will not restore confidence. Critical flaws, disabling functionality (dx12 in hanswell for example), instabbility instruction features etc.
I cannot afford to trust such a company
Dolda2000 - Sunday, November 7, 2021 - link
I just wanted to add a big Kudos for this article. AnandTech's coverage of the 12900K was by a wide margin the best of any I read or watched, with regards to coverage of the various variables involved, and with the breadth and depth of testing. Thanks for keeping it up!
chantzeleong - Monday, November 8, 2021 - link
I run Power bi and tensorflow with large dataset. Which Intel CPU do you recommend and why?
mode_13h - Tuesday, November 9, 2021 - link
I don't know about "Power bi", but Tensorflow should run best on GPUs. Which CPU to get then depends on how many GPUs you're going to use. If >= 3, then Threadripper. Otherwise, go for Alder Lake or Ryzen 5000 series.

You'll probably find the best advice among user communities for those specific apps.
velanapontinha - Monday, November 8, 2021 - link
We've seen this before. It is time to short AMD, unfortunately.

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Instruction Changes

P-core: Golden Cove vs Cypress Cove

E-core: Gracemont vs Tremont

Post Your Comment

474 Comments

View All Comments

Kvaern1 - Sunday, November 7, 2021 - link

eastcoast_pete - Sunday, November 7, 2021 - link

mode_13h - Monday, November 8, 2021 - link

eastcoast_pete - Monday, November 8, 2021 - link

mode_13h - Tuesday, November 9, 2021 - link

anemusek - Sunday, November 7, 2021 - link

Dolda2000 - Sunday, November 7, 2021 - link

chantzeleong - Monday, November 8, 2021 - link

mode_13h - Tuesday, November 9, 2021 - link

velanapontinha - Monday, November 8, 2021 - link

Log in

Don't have an account? Sign up now