Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

 

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

  • 6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
  • Triple load: (almost) universal
    • every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
    • BROADCAST*, GATHER*, PREFETCH* also gains
  • Decoupled double FADD units
    • every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
    • Another ADD/SUB means 4->2 clks
    • Another MUL means 4->3 clks
    • AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
    • exception: half precision ADD/SUB handled by FMAs
    • exception: x87 FADD remained 3 clks
  • Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
    • LEA r64, [r64+imm8]
    • ADD r64, imm8
    • ADD r64, imm32
    • INC r64
    • Is this just for 64b addition GPRs?
  • eliminated instructions:
    • MOV r32/r64
    • (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
    • 0-5 0x66 NOP
    • LNOP3-7
    • CLC/STC
  • zeroing idioms:
    • (V)XORPS/PD, (V)PXOR xmm, ymm
    • (V)PSUB(U)B/W/D/Q xmm
    • (V)PCMPGTB/W/D/Q xmm
    • (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

  • LOCK latency 20->18 clks
  • LEA with scale throughput 2->3/clk
  • (I)MUL r8 latency 4->3 clks
  • LAHF latency 3->1 clks
  • CMPS* latency 5->4 clks
  • REP CMPSB 1->3.7 Bytes/clock
  • REP SCASB 0.5->1.85 Bytes/clock
  • REP MOVS* 115->122 Bytes/clock
  • CMPXVHG16B 20|20 -> 16|14
  • PREFETCH* throughput 1->3/clk
  • ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
  • SHA1RNDS4 latency 6->4
  • SHA1MSG2 throughput 0.2->0.25/clock
  • SHA256MSG2 11|5->6|2
  • ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

  • *FADD xmm/ymm latency 4->3 clks (after MUL)
  • *FADD xmm/ymm latency 4->2 clks(after ADD)
  • * means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
  • VADD/SUB/PS/PD zmm  4|1->3.3|0.5
  • CLMUL xmm  6|1->3|1
  • CLMUL ymm, zmm 8|2->3|1
  • VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
  • VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
  • VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

  • Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
  • PAUSE latency 140->160 clocks
  • LEA with scale latency 2->3 clocks
  • (I)DIV r8 latency 15->17 clocks
  • FXCH throughput 2->1/clock
  • LFENCE latency 6->12 clocks
  • VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
  • (V)AESKEYGEN latency 12->13 clocks
  • VCVTPS2PH/PH2PS latency 5->6 clocks
  • BZHI throughput 2->1/clock
  • VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
  • VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

 

E-core: Gracemont vs Tremont

Microarchitecture Changes:

  • Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
  • Zen2-like memory renaming with GPRs
  • New zeroing idioms
    • SUB r32, r32
    • SUB r64, r64
    • CDQ, CQO
    • (V)PSUBB/W/D/Q/SB/SW/USB/USW
    • (V)PCMPGTB/W/D/Q
  • New ones idiom: (V)PCMPEQB/W/D/Q
  • MOV elimination: MOV; MOVZX; MOVSX r32, r64
  • NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

  • PAUSE latency 158->62 clocks
  • MOVSX; SHL/R r, 1; SHL/R r,imm8  tp 1->0.25
  • ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
  • CMOV* throughput 1->2 per clock
  • RCR r, 1 10|10 -> 2|2
  • RCR/RCL r, imm/cl 13|13->11|11
  • SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
  • MOVBE latency 1->0.5 clocks
  • (I)MUL r32 3|1 -> 3|0.5
  • (I)MUL r64 5|2 -> 5|0.5
  • REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

  • A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
  • MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
  • PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
  • PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
  • PShift (x)mm, (x)mm 2|1 -> 1|.33
  • PMUL*, PSADBW mm, xmm 4|1 -> 3|1
  • ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
  • MULPS/PD 4|1 -> 4|0.5
  • CVT*, ROUND xmm, xmm 4|1 -> 3|1
  • BLENDV* xmm, xmm 3|2 -> 3|0.88
  • AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
  • SHA256RNDS2 5|2 -> 4|1
  • PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

  • m8, m16 load latency 4->5 clocks
  • ADD/MOVBE load latency 4->5 clocks
  • LOCK ADD 16|16->18|18
  • XCHG mem 17|17->18|18
  • (I)DIV +1 clock
  • DPPS 10|1.5 -> 18|6
  • DPPD 6|1 -> 10|3.5
  • FSIN/FCOS +12% slower

 

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP
Comments Locked

474 Comments

View All Comments

  • ButIDontWantAUsername - Wednesday, November 10, 2021 - link

    How's that validation with Denuvo going? Nothing like upgrading to Intel and having your games suddenly start crashing.
  • Iketh - Tuesday, November 30, 2021 - link

    please, no more comments from you
  • tuxRoller - Friday, November 5, 2021 - link

    Most desktops at enterprise companies could be replaced with terminals given that most of the people are really just performing data entry & retrieval. The network is the bit doing the work.
    For people who need old school workstations, then I agree, but that's a damn small (but high margin) market.
  • blanarahul - Thursday, November 4, 2021 - link

    Alder Lake is extremely efficient when gaming - https://www.igorslab.de/en/intel-core-i9-12900kf-c...

    Scroll down and you'll find a graph detailing total gaming power consumption (CPU + GPU) and CPU power consumed per fps. In both metrics, Alder Lake is doing better than Zen 3 and much better than Rocket Lake.

    PC World's review - https://www.pcworld.com/article/548999/12th-gen-co... - conveys that while 12900K goes volcanic in Cinebench, it sips power in a real world workload.

    It seems like Alder Lake for desktop has been clocked way beyond its performance/watt sweet spot. It should be very interesting to compare Alder Lake for laptops v/s Zen 3 for laptops.
  • blanarahul - Thursday, November 4, 2021 - link

    To give a short summary for (only) CPU power consumption v/s FPS when playing Horizon Zero Dawn

    11900K consumes 100 watts for 143 fps
    5950X consumes 95 watts for 145 fps
    5800X consumes 59 watts for 144 fps
    12900K consumes 52 watts for 146 fps
    12700K consume 43 (!) watts for 145 fps

    Intel is very, very competent with AMD. Considering that 12700K has less E cores and consumes less power, I am very curious how it would do with all E cores disabled and running only on P cores.
  • Netmsm - Thursday, November 4, 2021 - link

    Sounds like there is only gaming world!
    In PCs it may not be considered as a egregious blunder however you're right Intel is now competitive but to previous AMD's if and only if we wink at Intel's guzzling power.

    Some examples from Tom's benches:
    y-cruncher
    12900k DDR5 consumes 197 watts whereas 5950x consumes 103 watts.

    handbrake
    12900k DDR5 consumes 224 watts whereas 5950x consumes 124 watts.

    blender bmw27
    12900k DDR5 consumes 205 watts whereas 5950x consumes 125 watts.

    Will you calculate power efficiency, please?
  • geoxile - Thursday, November 4, 2021 - link

    My 5950X uses 130-140W in y-cruncher. And @TweakPC on twitter tested lower PL1 and found the 12900k was only around 5% slower using 150W than 218W. Alderlake being power hungry is only because Intel is pushing 8 P-cores and 8 E-cores (collectively equal to around 4 P-cores according to Intel) to the limit, to compete against 16 Zen 3 cores. You can argue that it's still not as good as the 5950X but efficiency in this case is purely a problem of how much power Intel is allowing by default
  • flyingpants265 - Thursday, November 4, 2021 - link

    Because they need all that extra power to increase their performance a tiny bit. They're not just doing it for fun.
  • Netmsm - Saturday, November 6, 2021 - link

    Exactly 👍
  • Netmsm - Thursday, November 4, 2021 - link

    Even Ian has "accidentally" forgotten to put nominal TDP for 12900k in results =))
    All CPUs in "CUP Benchmark Performance: Intel vs AMD" are mentioned with their nominal TDP except 12900k.
    It sounds there's some recommendations! How venal!

Log in

Don't have an account? Sign up now