The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Name: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity
Item: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

by Dr. Ian Cutress & Andrei Frumusanu on November 4, 2021 9:00 AM EST

474 Comments | Add A Comment

474 Comments

Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
Triple load: (almost) universal
- every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
- BROADCAST*, GATHER*, PREFETCH* also gains
Decoupled double FADD units
- every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
- Another ADD/SUB means 4->2 clks
- Another MUL means 4->3 clks
- AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
- exception: half precision ADD/SUB handled by FMAs
- exception: x87 FADD remained 3 clks
Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
- LEA r64, [r64+imm8]
- ADD r64, imm8
- ADD r64, imm32
- INC r64
- Is this just for 64b addition GPRs?
eliminated instructions:
- MOV r32/r64
- (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
- 0-5 0x66 NOP
- LNOP3-7
- CLC/STC
zeroing idioms:
- (V)XORPS/PD, (V)PXOR xmm, ymm
- (V)PSUB(U)B/W/D/Q xmm
- (V)PCMPGTB/W/D/Q xmm
- (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

LOCK latency 20->18 clks
LEA with scale throughput 2->3/clk
(I)MUL r8 latency 4->3 clks
LAHF latency 3->1 clks
CMPS* latency 5->4 clks
REP CMPSB 1->3.7 Bytes/clock
REP SCASB 0.5->1.85 Bytes/clock
REP MOVS* 115->122 Bytes/clock
CMPXVHG16B 20|20 -> 16|14
PREFETCH* throughput 1->3/clk
ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
SHA1RNDS4 latency 6->4
SHA1MSG2 throughput 0.2->0.25/clock
SHA256MSG2 11|5->6|2
ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

*FADD xmm/ymm latency 4->3 clks (after MUL)
*FADD xmm/ymm latency 4->2 clks(after ADD)
* means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
VADD/SUB/PS/PD zmm 4|1->3.3|0.5
CLMUL xmm 6|1->3|1
CLMUL ymm, zmm 8|2->3|1
VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
PAUSE latency 140->160 clocks
LEA with scale latency 2->3 clocks
(I)DIV r8 latency 15->17 clocks
FXCH throughput 2->1/clock
LFENCE latency 6->12 clocks
VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
(V)AESKEYGEN latency 12->13 clocks
VCVTPS2PH/PH2PS latency 5->6 clocks
BZHI throughput 2->1/clock
VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

E-core: Gracemont vs Tremont

Microarchitecture Changes:

Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
Zen2-like memory renaming with GPRs
New zeroing idioms
- SUB r32, r32
- SUB r64, r64
- CDQ, CQO
- (V)PSUBB/W/D/Q/SB/SW/USB/USW
- (V)PCMPGTB/W/D/Q
New ones idiom: (V)PCMPEQB/W/D/Q
MOV elimination: MOV; MOVZX; MOVSX r32, r64
NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

PAUSE latency 158->62 clocks
MOVSX; SHL/R r, 1; SHL/R r,imm8 tp 1->0.25
ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
CMOV* throughput 1->2 per clock
RCR r, 1 10|10 -> 2|2
RCR/RCL r, imm/cl 13|13->11|11
SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
MOVBE latency 1->0.5 clocks
(I)MUL r32 3|1 -> 3|0.5
(I)MUL r64 5|2 -> 5|0.5
REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
PShift (x)mm, (x)mm 2|1 -> 1|.33
PMUL*, PSADBW mm, xmm 4|1 -> 3|1
ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
MULPS/PD 4|1 -> 4|0.5
CVT*, ROUND xmm, xmm 4|1 -> 3|1
BLENDV* xmm, xmm 3|2 -> 3|0.88
AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
SHA256RNDS2 5|2 -> 4|1
PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

m8, m16 load latency 4->5 clocks
ADD/MOVBE load latency 4->5 clocks
LOCK ADD 16|16->18|18
XCHG mem 17|17->18|18
(I)DIV +1 clock
DPPS 10|1.5 -> 18|6
DPPD 6|1 -> 10|3.5
FSIN/FCOS +12% slower

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

474 Comments

View All Comments

Netmsm - Sunday, November 7, 2021 - link
I believe, we're not talking about ISO-efficiency or manufacturing or engineering details as facts! These are facts but in the appropriate discussion. Here, we have results. These results are produced by all those technological efforts. In fact, those theoretical improvements are getting concluded in these pragmatical information. Therefore, we should NOT wink at performance per watt in RESULTS - not ISO-related matters.

So, the fact, my friend, is Intel new architecture does tend to suck 70-80 percent more power and give 50-60 percent more heat. Just by overclocking 100MHz 12900k jumps from ~80-85 to 100 degrees centigrade while consuming ~300 watts.

Once in past, AMD tried to get ahead of Nvidia by 6990 in performance because they coveted the most powerful graphic card title. AMD made the hottest and the noisiest graphic card in the history and now Intel is mimicking :))
One can argue that it is natural when you cannot stop or catch a rival so try to do some chicaneries. As it is very clear that Anandtech deliberately does not tend to put even the nominal TDP of Intel 12900k in their benches. I loathe this iniquitous practice!
Wrs - Sunday, November 7, 2021 - link
@Netmsm I believe the mistake is construing performance-per-watt (PPW) of a consumer chip as indicative of PPW for a future server chip based on the same core. Consumer chips are typically optimized for performance-per-area (PPA) because consumers want snappiness and they are afraid of high purchase costs while simultaneously caring much less than datacenters about cost of electricity.
Netmsm - Monday, November 8, 2021 - link
@Wrs You cannot totally separate efficiency of consumer and enterprise chips!
As an incontrovertible fact, architecture is what primarily (not completely) determines the efficacy of a processor.
Is Intel going to kit out upcoming server CPUs in an improved architecture?
Wrs - Monday, November 8, 2021 - link
@Netmsm Architecture, process, and configuration all can heavily impact efficiency/PPW. I’m not aware of any architectural reason that Golden Cove would be much less efficient. It’s a mildly larger core, but it doesn’t have outrageous pipelining or execution imbalances. It derives from a lineage of reasonably efficient cores, and they had to be as they remained on aging 14nm. Processwise Intel 7 isn’t much less efficient than TSMC N7, either. (It could even be more efficient, but analysis hasn’t been precise enough to tell.) But clearly ADL in a 12900/12700k is set up to be inefficient yet performant at high load by virtue of high frequency/voltage scaling and thermal density. I could do almost the same on a dual CCD Ryzen, before running into AM4 socket limits. That’s obviously not how either company approaches server chips.
Netmsm - Tuesday, November 9, 2021 - link
When you cannot infer or appraise or guess we should drop it for now and wait for real tests of upcoming server chips to come.
regards ^_^
GamingRiggz - Tuesday, March 15, 2022 - link
Thankfully you are no engineer.
AbRASiON - Thursday, November 4, 2021 - link
AMD would have less of an issue If the 5000 processors weren’t originally priced gouged.

Many people held off switching teams due to that. Instead of the processor being an amazing must buy, it was just a decent purchase. So they waited.

If you’re On the back foot in this game, you should be competing hard always to get that stranglehold and mind share.

I’m glad they’re competing though and hopefully they release some very competitive and REASONABLY PRICED products in the near future.
Fataliity - Thursday, November 4, 2021 - link
Their revenue and marketshare #'s beg to disagree.
Spunjji - Friday, November 5, 2021 - link
They've been selling every CPU they can make. There are shortages of every Zen 3 based notebook out there (to the extent that some OEMs have cancelled certain models) and they're selling so many products based on the desktop chiplets that Threadripper 5000 simply isn't a thing. You ought to factor that into your assessment of how they're doing.
BillBear - Thursday, November 4, 2021 - link
Is anyone gullible enough to forget more than a decade of price gouging, low core counts and nearly nonexistent performance increases we got from Intel, vs. the high core counts, increasing performance, and lower prices we got from AMD?

The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity

Instruction Changes

P-core: Golden Cove vs Cypress Cove

E-core: Gracemont vs Tremont

Post Your Comment

474 Comments

View All Comments

Netmsm - Sunday, November 7, 2021 - link

Wrs - Sunday, November 7, 2021 - link

Netmsm - Monday, November 8, 2021 - link

Wrs - Monday, November 8, 2021 - link

Netmsm - Tuesday, November 9, 2021 - link

GamingRiggz - Tuesday, March 15, 2022 - link

AbRASiON - Thursday, November 4, 2021 - link

Fataliity - Thursday, November 4, 2021 - link

Spunjji - Friday, November 5, 2021 - link

BillBear - Thursday, November 4, 2021 - link

Log in

Don't have an account? Sign up now