Instruction Changes

Both of the processor cores inside Alder Lake are brand new – they build on the previous generation Core and Atom designs in multiple ways. As always, Intel gives us a high level overview of the microarchitecture changes, as we’ve written in an article from Architecture Day:

At the highest level, the P-core supports a 6-wide decode (up from 4), and has split the execution ports to allow for more operations to execute at once, enabling higher IPC and ILP from workflow that can take advantage. Usually a wider decode consumes a lot more power, but Intel says that its micro-op cache (now 4K) and front-end are improved enough that the decode engine spends 80% of its time power gated.

For the E-core, similarly it also has a 6-wide decode, although split to 2x3-wide. It has a 17 execution ports, buffered by double the load/store support of the previous generation Atom core. Beyond this, Gracemont is the first Atom core to support AVX2 instructions.

As part of our analysis into new microarchitectures, we also do an instruction sweep to see what other benefits have been added. The following is literally a raw list of changes, which we are still in the process of going through. Please forgive the raw data. Big thanks to our industry friends who help with this analysis.

Any of the following that is listed as A|B means A in latency (in clocks) and B in reciprocal throughput (1/instructions).

 

P-core: Golden Cove vs Cypress Cove

Microarchitecture Changes:

  • 6-wide decoder with 32b window: it means code size much less important, e.g. 3 MOV imm64 / clks;(last similar 50% jump was Pentium -> Pentium Pro in 1995, Conroe in 2006 was just 3->4 jump)
  • Triple load: (almost) universal
    • every GPR, SSE, VEX, EVEX load gains (only MMX load unsupported)
    • BROADCAST*, GATHER*, PREFETCH* also gains
  • Decoupled double FADD units
    • every single and double SIMD VADD/VSUB (and AVX VADDSUB* and VHADD*/VHSUB*) has latency gains
    • Another ADD/SUB means 4->2 clks
    • Another MUL means 4->3 clks
    • AVX512 support: 512b ADD/SUB rec. throughput 0.5, as in server!
    • exception: half precision ADD/SUB handled by FMAs
    • exception: x87 FADD remained 3 clks
  • Some form of GPR (general purpose register) immediate additions treated as NOPs (removed at the "allocate/rename/move ellimination/zeroing idioms" step)
    • LEA r64, [r64+imm8]
    • ADD r64, imm8
    • ADD r64, imm32
    • INC r64
    • Is this just for 64b addition GPRs?
  • eliminated instructions:
    • MOV r32/r64
    • (V)MOV(A/U)(PS/PD/DQ) xmm, ymm
    • 0-5 0x66 NOP
    • LNOP3-7
    • CLC/STC
  • zeroing idioms:
    • (V)XORPS/PD, (V)PXOR xmm, ymm
    • (V)PSUB(U)B/W/D/Q xmm
    • (V)PCMPGTB/W/D/Q xmm
    • (V)PXOR xmm

Faster GPR instructions (vs Cypress Cove):

  • LOCK latency 20->18 clks
  • LEA with scale throughput 2->3/clk
  • (I)MUL r8 latency 4->3 clks
  • LAHF latency 3->1 clks
  • CMPS* latency 5->4 clks
  • REP CMPSB 1->3.7 Bytes/clock
  • REP SCASB 0.5->1.85 Bytes/clock
  • REP MOVS* 115->122 Bytes/clock
  • CMPXVHG16B 20|20 -> 16|14
  • PREFETCH* throughput 1->3/clk
  • ANDN/BLSI/BLSMSK/BLSR throughput 2->3/clock
  • SHA1RNDS4 latency 6->4
  • SHA1MSG2 throughput 0.2->0.25/clock
  • SHA256MSG2 11|5->6|2
  • ADC/SBB (r/e)ax 2|2 -> 1|1

Faster SIMD instructions (vs Cypress Cove):

  • *FADD xmm/ymm latency 4->3 clks (after MUL)
  • *FADD xmm/ymm latency 4->2 clks(after ADD)
  • * means (V)(ADD/SUB/ADDSUB/HADD/HSUB)(PS/PD) affected
  • VADD/SUB/PS/PD zmm  4|1->3.3|0.5
  • CLMUL xmm  6|1->3|1
  • CLMUL ymm, zmm 8|2->3|1
  • VPGATHERDQ xmm, [xm32], xmm 22|1.67->20|1.5 clks
  • VPGATHERDD ymm, [ym32], ymm throughput 0.2 -> 0.33/clock
  • VPGATHERQQ ymm, [ym64], ymm throughput 0.33 -> 0.50/clock

Regressions, Slower instructions (vs Cypress Cove):

  • Store-to-Load-Forward 128b 5->7, 256b 6->7 clocks
  • PAUSE latency 140->160 clocks
  • LEA with scale latency 2->3 clocks
  • (I)DIV r8 latency 15->17 clocks
  • FXCH throughput 2->1/clock
  • LFENCE latency 6->12 clocks
  • VBLENDV(B/PS/PD) xmm, ymm 2->3 clocks
  • (V)AESKEYGEN latency 12->13 clocks
  • VCVTPS2PH/PH2PS latency 5->6 clocks
  • BZHI throughput 2->1/clock
  • VPGATHERDD ymm, [ym32], ymm latency 22->24 clocks
  • VPGATHERQQ ymm, [ym64], ymm latency 21->23 clocks

 

E-core: Gracemont vs Tremont

Microarchitecture Changes:

  • Dual 128b store port (works with every GPR, PUSH, MMX, SSE, AVX, non-temporal m32, m64, m128)
  • Zen2-like memory renaming with GPRs
  • New zeroing idioms
    • SUB r32, r32
    • SUB r64, r64
    • CDQ, CQO
    • (V)PSUBB/W/D/Q/SB/SW/USB/USW
    • (V)PCMPGTB/W/D/Q
  • New ones idiom: (V)PCMPEQB/W/D/Q
  • MOV elimination: MOV; MOVZX; MOVSX r32, r64
  • NOP elimination: NOP, 1-4 0x66 NOP throughput 3->5/clock, LNOP 3, LNOP 4, LNOP 5

Faster GPR instructions (vs Tremont)

  • PAUSE latency 158->62 clocks
  • MOVSX; SHL/R r, 1; SHL/R r,imm8  tp 1->0.25
  • ADD;SUB; CMP; AND; OR; XOR; NEG; NOT; TEST; MOVZX; BSSWAP; LEA [r+r]; LEA [r+disp8/32] throughput 3->4 per clock
  • CMOV* throughput 1->2 per clock
  • RCR r, 1 10|10 -> 2|2
  • RCR/RCL r, imm/cl 13|13->11|11
  • SHLD/SHRD r1_32, r1_32, imm8 2|2 -> 2|0.5
  • MOVBE latency 1->0.5 clocks
  • (I)MUL r32 3|1 -> 3|0.5
  • (I)MUL r64 5|2 -> 5|0.5
  • REP STOSB/STOSW/STOSD/STOSQ 15/8/12/11 byte/clock -> 15/15/15/15 bytes/clock

Faster SIMD instructions (vs Tremont)

  • A lot of xmm SIMD throughput is 4/clock instead of theoretical maximum(?) of 3/clock, not sure how this is possible
  • MASKMOVQ throughput 1 per 104 clocks -> 1 per clock
  • PADDB/W/D; PSUBB/W/D PAVGB/PAVGW 1|0.5 -> 1|.33
  • PADDQ/PSUBQ/PCMPEQQ mm, xmm: 2|1 -> 1|.33
  • PShift (x)mm, (x)mm 2|1 -> 1|.33
  • PMUL*, PSADBW mm, xmm 4|1 -> 3|1
  • ADD/SUB/CMP/MAX/MINPS/PD 3|1 -> 3|0.5
  • MULPS/PD 4|1 -> 4|0.5
  • CVT*, ROUND xmm, xmm 4|1 -> 3|1
  • BLENDV* xmm, xmm 3|2 -> 3|0.88
  • AES, GF2P8AFFINEQB, GF2P8AFFINEINVQB xmm 4|1 -> 3|1
  • SHA256RNDS2 5|2 -> 4|1
  • PHADD/PHSUB* 6|6 -> 5|5

Regressions, Slower (vs Tremont):

  • m8, m16 load latency 4->5 clocks
  • ADD/MOVBE load latency 4->5 clocks
  • LOCK ADD 16|16->18|18
  • XCHG mem 17|17->18|18
  • (I)DIV +1 clock
  • DPPS 10|1.5 -> 18|6
  • DPPD 6|1 -> 10|3.5
  • FSIN/FCOS +12% slower

 

Power: P-Core vs E-Core, Win10 vs Win11 CPU Tests: Core-to-Core and Cache Latency, DDR4 vs DDR5 MLP
Comments Locked

474 Comments

View All Comments

  • ajollylife - Sunday, November 7, 2021 - link

    I agree. I've got a 3995wx everything on qvl, even with an optane drive. Got too annoyed with the bugs and found a 5950x worked better for a high performance desktop. Going to swap to a 12900k once i can find parts.
  • TheJian - Sunday, November 7, 2021 - link

    If you know how to use mem timings, you idiots that depend on SPD's wouldn't have these problems (that covers about 90% of this crap, and knowing other bios settings solves almost anything else besides REAL failures). I've been building systems for decades (and owned a PC biz for 8yrs myself) and a MB's QVL list was barely used by anyone I know (perhaps to look up some ODD part but otherwise...Just not enough covered at launch etc). If I waited for my fav stuff to be included in each list I'd never build. Just buy top parts and you don't worry much about this crap.

    That said, if my job was on the line, I'd check the list, but not because I was worried about ever being wrong...LOL. I just don't have a liars face. I'd be laughing about how stupid I think it is after so many builds and seeing so many "incompatible memory" fixed in seconds in the hands of someone not afraid to disable the SPD and get to work (or hook up with a strap before blowing gigs of modules, nics repeatedly etc). Even mixing modules means nothing then (again, maybe if I was pitching servers...DUH....1 error can be millions) after just trying to make issues exists with mixing/matching but with timings CORRECT. No, they will work, if set correct barring some REAL electrical issue (like a PSU model from brand X frying a particular model mboard - say dozens in a weekend, a few myself!).

    Too many DIY people out that that really have no business building a PC. No idea what ESD is (no just because it took a hit and still works doesn't mean it isn't damaged), A+ what?? Training? Pfft, it's just some screws and slots...Whatever...Said the guy with machine after machine that have never quite worked right...LOL. If you live in SF or some wet joint OK (leo leporte etc? still around), otherwise, just buy a dell/hp and call it a day. They exist because most of you are incapable of doing the job correctly, or god forbid troubleshooting ANYTHING that doesn't just WORK OOB.
  • Qasar - Sunday, November 7, 2021 - link

    blah blah blah blah blah
  • Midland_Dog - Saturday, November 27, 2021 - link

    people like you cost amd sales
    silly amdumb
  • cyberpunx_r_ded - Friday, November 5, 2021 - link

    sounds like a Mobo problem, not a CPU problem....for someone who has put together "hundreds of systems" you should know that by the symptoms.

    That motherboard is known to be dog sh1t btw.
  • DominionSeraph - Saturday, November 6, 2021 - link

    Note Intel doesn't allow "dog sh1t motherboards" to happen, especially at the $300+ price point. That makes it an AMD issue.
    I can refurb Dell after Dell after Dell after Dell, all of them on low-end chipsets and still on the release BIOS, and they all work fabulously.
    Meanwhile two years into x570 and AMD is still working on getting USB working right.

    I think I'll put this thing on the market and see if I can recoup the better part of an i9 12900k build. I may have to drop down to one of the i7 6700's or the i7 4770k system I have until they're in stock, but that's really no issue.
  • Netmsm - Saturday, November 6, 2021 - link

    It's a pleasure to not have p*gheaded amateurs in the AMD zone.
    Others are telling you it's not AMD issue but you spamming it's AMD, AMD, AMD... having got the wrong and of the stick.
  • Wrs - Saturday, November 6, 2021 - link

    @Netmsm Regardless of whether the blame lies with ASRock for the above issue, it remains a fact that AMD didn't fix a USB connectivity problem in Zen 3 until 6-7 months after initial availability. Partly that was because the installed base of guinea pigs was constricted by limited product, but it goes to show that quick and widespread product rollouts have a better chance of ironing out the kinks. (Source if you've been under a rock heh https://www.anandtech.com/show/16554/amd-set-to-ro...

    And then recently we had Windows 11 performance regressions with Zen 3 cache and sandboxed security. These user experience hiccups suggest one company perceptibly lags the other in platform support. It's just something I've noticed switching between Intel and AMD. I might think this all to be normal were I loyal to one platform.
  • Netmsm - Sunday, November 7, 2021 - link

    I didn't realize we're here to discuss minor issues/incompatibilities of the Intel's rival. I thought we're here to talk about major inefficiencies besides improvements of Intel's new architecture. Sorry!
  • Wrs - Sunday, November 7, 2021 - link

    @Netmsm That's no minor issue/incompatibility. Maybe for you, but a USB dropout is not trivial! Think missing keystrokes, stuttering audio for USB headsets and capture cards. It didn't affect every user, and was intermittent, which was part of the difficulty. I put off a Ryzen 5000 purchase for 2 months waiting for them to fix it. (I also put it off for 4 months before that because of lack of stock lol.)

Log in

Don't have an account? Sign up now