The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Name: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity
Item: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

by Dr. Ian Cutress & Andrei Frumusanu on November 4, 2021 9:00 AM EST

474 Comments | Add A Comment

474 Comments

Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
Triple load: (almost) universal
- every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
- BROADCAST*, GATHER*, PREFETCH* also gains
Decoupled double FADD units
- every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
- Another ADD/SUB means 4->2 clks
- Another MUL means 4->3 clks
- AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
- exception: half precision ADD/SUB handled by FMAs
- exception: x87 FADD remained 3 clks
Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
- LEA r64, [r64+imm8]
- ADD r64, imm8
- ADD r64, imm32
- INC r64
- Is this just for 64b addition GPRs?
eliminated instructions:
- MOV r32/r64
- (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
- 0-5 0x66 NOP
- LNOP3-7
- CLC/STC
zeroing idioms:
- (V)XORPS/PD, (V)PXOR xmm, ymm
- (V)PSUB(U)B/W/D/Q xmm
- (V)PCMPGTB/W/D/Q xmm
- (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

LOCK latency 20->18 clks
LEA with scale throughput 2->3/clk
(I)MUL r8 latency 4->3 clks
LAHF latency 3->1 clks
CMPS* latency 5->4 clks
REP CMPSB 1->3.7 Bytes/clock
REP SCASB 0.5->1.85 Bytes/clock
REP MOVS* 115->122 Bytes/clock
CMPXVHG16B 20|20 -> 16|14
PREFETCH* throughput 1->3/clk
ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
SHA1RNDS4 latency 6->4
SHA1MSG2 throughput 0.2->0.25/clock
SHA256MSG2 11|5->6|2
ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

*FADD xmm/ymm latency 4->3 clks (after MUL)
*FADD xmm/ymm latency 4->2 clks(after ADD)
* means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
VADD/SUB/PS/PD zmm 4|1->3.3|0.5
CLMUL xmm 6|1->3|1
CLMUL ymm, zmm 8|2->3|1
VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
PAUSE latency 140->160 clocks
LEA with scale latency 2->3 clocks
(I)DIV r8 latency 15->17 clocks
FXCH throughput 2->1/clock
LFENCE latency 6->12 clocks
VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
(V)AESKEYGEN latency 12->13 clocks
VCVTPS2PH/PH2PS latency 5->6 clocks
BZHI throughput 2->1/clock
VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

E-core: Gracemont vs Tremont

Microarchitecture Changes:

Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
Zen2-like memory renaming with GPRs
New zeroing idioms
- SUB r32, r32
- SUB r64, r64
- CDQ, CQO
- (V)PSUBB/W/D/Q/SB/SW/USB/USW
- (V)PCMPGTB/W/D/Q
New ones idiom: (V)PCMPEQB/W/D/Q
MOV elimination: MOV; MOVZX; MOVSX r32, r64
NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

PAUSE latency 158->62 clocks
MOVSX; SHL/R r, 1; SHL/R r,imm8 tp 1->0.25
ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
CMOV* throughput 1->2 per clock
RCR r, 1 10|10 -> 2|2
RCR/RCL r, imm/cl 13|13->11|11
SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
MOVBE latency 1->0.5 clocks
(I)MUL r32 3|1 -> 3|0.5
(I)MUL r64 5|2 -> 5|0.5
REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
PShift (x)mm, (x)mm 2|1 -> 1|.33
PMUL*, PSADBW mm, xmm 4|1 -> 3|1
ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
MULPS/PD 4|1 -> 4|0.5
CVT*, ROUND xmm, xmm 4|1 -> 3|1
BLENDV* xmm, xmm 3|2 -> 3|0.88
AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
SHA256RNDS2 5|2 -> 4|1
PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

m8, m16 load latency 4->5 clocks
ADD/MOVBE load latency 4->5 clocks
LOCK ADD 16|16->18|18
XCHG mem 17|17->18|18
(I)DIV +1 clock
DPPS 10|1.5 -> 18|6
DPPD 6|1 -> 10|3.5
FSIN/FCOS +12% slower

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

474 Comments

View All Comments

xhris4747 - Tuesday, November 9, 2021 - link
They should use pbo it's fair to
xhris4747 - Tuesday, November 9, 2021 - link
Is you using pbo some people are t using pbo which I think isn't fair because that i9 is oc to snot
EnglishMike - Thursday, November 4, 2021 - link
It's not just the gaming world -- it's the entire world except for long-running CPU intensive tasks. Handbrake and blender are valuable benchmarking tools for seeing what a CPU is capable of when pushed to the limit, but the vast majority of users -- even most power users -- don't do that.

Sure, Intel has more work to do to improve power efficiency in long running CPU intensive workloads, but taking the worst case power usage scenarios distorts the picture as much as you're claiming the reviewers are doing.
Wrs - Thursday, November 4, 2021 - link
Can't calculate efficiency without scores. Also, well known that power scales much faster than performance. The proper way to compare efficiency is really at constant work rate or constant power.
blanarahul - Thursday, November 4, 2021 - link
Sorry sir I can't. You haven't provided me the data for how much time each test took! Would you be so kind as to do that?
Netmsm - Thursday, November 4, 2021 - link
Sorry, this is a direct link to Tom's bench:
https://cdn.mos.cms.futurecdn.net/if3Lox9ZJBRxjbhr...
this is for "blender bmw27" in which both 12900k and 5950x finish the job around 80 seconds BUT 12900k sucks power for about 70 percent more than 5950x.

you can find other benches here:
https://www.tomshardware.com/news/intel-core-i9-12...

I'm wondering why Ian hasn't put 12900k nominal TDP in results just like all other CPU's! When 10900k was released with nominal TDP of 125, Ian put than number in every bench while in reality 10900k was consuming up to 254 (according to the Ian's review)! When I asked him to put real numbers of power consumption for every test he said I can't because of time and because I've too much to do and because I've no money to pay and delegate such works to an assistant!
But now we have 12900k with nominal TDP of 241 which seems unpleasant to Ian to put it in front of it in results.
Zingam - Friday, November 5, 2021 - link
Last gen game. How about glquake?

1 billion computing devices and just a few million game units sold? What does it mean? Gamers are a tiny but vocal minority.
If they bring this performance at 5W on low and 45W on high then its good for majority of people. This is just a space heater.
Gothmoth - Friday, November 5, 2021 - link
so throwing more cores on a game that can´t make use of them is usless thanks for clarifing that.... genius!!

when a 5600x is producing 144 FPS and a 5950x is producing 150 FPS the 5600x is the clear winner when it comes to efficency.

now try to cool the 12900K in a work environment with an air cooler.
i can cool my threadripper with a noctua aircooler and let it run under full load for ours.

i am really curious to see how the 12900k will handle that.

i am not an amd fanboy. i was using anti-consumer intel for a decade before switching to ryzen.
i would us intel again when it makes sense for me (i need my pc for work not gaming).

but with this power draw it does not make sense.
Wrs - Saturday, November 6, 2021 - link
The 12900k is fine with a Noctua D15 in a work environment. Doesn't matter if you're hammering it at 95C the whole time, the D15 doesn't get louder. But it's no megachip like a Threadripper. For that on the Intel side you'd wait for Sapphire Rapids or put up with an existing Xeon Gold with 8-32 Ice Lake cores at 10nm.
Netmsm - Saturday, November 6, 2021 - link
How would it be justified to buy Xeon Gold in place of Threadripper and Epyc?!

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Instruction Changes

P-core: Golden Cove vs Cypress Cove

E-core: Gracemont vs Tremont

Post Your Comment

474 Comments

View All Comments

xhris4747 - Tuesday, November 9, 2021 - link

xhris4747 - Tuesday, November 9, 2021 - link

EnglishMike - Thursday, November 4, 2021 - link

Wrs - Thursday, November 4, 2021 - link

blanarahul - Thursday, November 4, 2021 - link

Netmsm - Thursday, November 4, 2021 - link

Zingam - Friday, November 5, 2021 - link

Gothmoth - Friday, November 5, 2021 - link

Wrs - Saturday, November 6, 2021 - link

Netmsm - Saturday, November 6, 2021 - link

Log in

Don't have an account? Sign up now