Security Updates, Improved Instruction Performance and AVX-512 Updates

With every new microarchitecture update, there are goals on several fronts: add new instructions, decrease the latency of current instructions, increase the throughput of current instructions, and remove bugs. The big headline addition for Sunny Cove and Ice Lake is AVX-512, which hasn’t yet appeared on a mainstream widely distributed consumer processor – technically we saw it in Cannon Lake, but that was a limited run CPU. Nonetheless, a lot of what went into Cannon Lake also shows up in the Sunny Cove design. To complicate matters, AVX-512 comes in plenty of different flavors. But on top of that, Intel also made a significant number of improvements to a number of instructions throughout the design.

Big thanks to InstLatX64 for his help in analyzing the benchmark results.

Security

On security, almost all the documented hardware security fixes are in place with Sunny Cove. Through the CPUID results, we can determine that SSBD is enabled, as is IA32_ARCH_CAPABILITIES, L1D_FLUSH, STIBP, IBPB/IBRS and MD_CLEAR.

This aligns with Intel’s list of Sunny Cove security improvements:

Sunny Cove Security
AnandTech Description Name Solution
BCB Bound Check Bypass Spectre V1 Software
BTI Branch Target Injection Spectre V2 Hardware+OS
RDCL Rogue Data Cache Load V3 Hardware
RSSR Rogue System Register Read V3a Hardware
SSB Speculative Store Bypass V4 Hardware+OS
L1TF Level 1 Terminal Fault Foreshadow Hardware
MFBDS uArch Fill Buffer Data Sampling RIDL Hardware
MSBDS uArch Store Buffer Data Sampling Fallout Hardware
MLPDS uArch Load Port Data Sampling - Hardware
MDSUM uArch Data Sampling Uncachable Memory - Hardware

Aside from Spectre V1, which has no suitable hardware solution, almost all of the rest have been solved through hardware/firmware (Intel won’t distinguish which, but to a certain extent it doesn’t matter for new hardware). This is a step in the right direction, but of course it may have a knock-on effect, plus for anything that gets performance improvements being moved from firmware to hardware will be rolled into any advertised IPC increase.

Also on the security side is SGX, or Intel’s Software Guard Instructions. Sunny Cove now becomes Intel’s first public processor to enable both AVX-512 and SGX in the same design. Technically the first chip with both SGX and AVX-512 should have been Skylake-X, however that feature was ultimately disabled due to failing some test validation cases. But it now comes together for Sunny Cove in Ice Lake-U, which is also a consumer processor.

Instruction Improvements and AVX-512

As mentioned, Sunny Cove pulls a number of key improvements from the Cannon Lake design, despite the Cannon Lake chip having the same cache configuration as Skylake. One of the key points here is the 64-bit division throughput, which goes from a 97-cycle latency to an 18-cycle latency, blowing past AMD’s 45-cycle latency. As an ex-researcher with no idea about instruction latency or compiler options, working on high-precision math code, this speedup would have been critical.

  • IDIV -> 97-cycle to 18-cycle

For the general purpose registers, we see a lot of changes, and most of them quite sizable.

Sunny Cove GPR Changes
AnandTech Instruction Skylake Sunny Cove
Complex LEA Complex Load Effective Address 3 cycle latency
1 per cycle
1 cycle latency
2 per cycle
SHL/SHR Shift Left/Right 2 cycle latency
0.5 per cycle
1 cycle latency
1 per cycle
ROL/ROR Rotate Left/Right 2 cycle latency
0.5 per cycle
1 cycle latency
1 per cycle
SHLD/SHRD Double Precision Shift Left/Right 4 cycle latency
0.5 per cycle
4 cycle latency
1 per cycle
4*MOV Four repated string MOVS Limited instructions 104 bits/clock
All MOVS* Instructions

In the past we’ve seen x87 instructions being regressed, made slower, as they become obsolete. For whatever reason, Sunny Cove decreases the FMUL latency from 5 cycles to 4 cycles.

The SIMD units also go through some changes:

Sunny Cove SIMD
AnandTech Instruction Skylake Sunny Cove
SIMD Packing SIMD Packing now slower 1 cycle latency
1 per cycle
3 cycle latency
1 per cycle
AES* AES Crypto Instructions
(for 128-bit / 256-bit)
4 cycle latency
2 per cycle
3 cycle latency
2 per cycle
CLMUL Carry-Less Multiplication 7 cycle latency
1 per cycle
6 cycle latency
1 per cycle
PHADD/PHSUB Packed Horizontal Add/Subtract
and Saturate
3 cycle latency
0.5 per cycle
2 cycle latency
1 per cycle
VPMOV* xmm Vector Packed Move 2 cycle latency
0.5 per cycle
2 cycle latency
1 per cycle
VPMOV* ymm Vector Packed Move 4 cycle latency
0.5 per cycle
2 cycle latency
1 per cycle
VPMOVZX/SX* xmm Vector Packed Move 1 cycle latency
1 per cycle
1 cycle latency
2 per cycle
POPCNT Microcode 50% faster than SW (under L1-D size)
REP STOS* Repeated Store String 62 bits/cycle 54 bits/cycle
VPCONFLICT Still Microcode Only

We’ve already gone through all of the new AVX-512 instructions in our Sunny Cove microarchitecture disclosure. These include the following families:

  • AVX-512_VNNI (Vector Neural Network Instructions)
  • AVX-512_VBMI (Vector Byte Manipulation Instructions)
  • AVX-512_VBMI2 (second level VBMI)
  • AVX-512_ BITALG (bit algorithms)
  • AVX-512_IFMA (Integer Fused Multiply Add)
  • AVX-512_VAES (Vector AES)
  • AVX-512_VPCLMULQDQ (Carry-Less Multiplacation of Long Quad Words)
  • AVX-512+GFNI (Galois Field New Instructions)
  • SHA (not AVX-512, but still new)
  • GNA (Gaussian Neural Accelerator)

(Intel also has the GMM (Gaussian Mixture Model) inside the core since Skylake, but I’ve yet to see any information on this outside a single line in the coding manual.)

For all these new AVX-512 instructions, it’s worth noting that they can be run in 128-bit, 256-bit, or 512-bit mode, depending on the data types passed to it. Each of these can have corresponding latencies and throughputs, which often get worse when going for the 512-bit mode, but overall assuming you can fill the register with a 512-bit data type, then the overall raw processing will be faster, even with the frequency differential. This doesn’t take into account any additional overhead for entering the 512-bit power state, it should be noted.

Most of these new instructions are relatively fast, with most of them only 1-3 cycles of latency. We observed the following:

Sunny Cove Vector Instructions
AnandTech Instruction XMM YMM ZMM
VNNI Latency Vector Neural Network Instructions 5-cycle 5-cycle 5-cycle
Throughput 2/cycle 2/cycle 1/cycle
VPOPCNT* Latency Return the number of bits set to 1 3-cycle 3-cycle 3-cycle
Throughput 1/cycle 1/cycle 1/cycle
VPCOMPRESS* Latency Store Packed Data 3-cycle 3-cycle 3-cycle
Throughput 0.5/cycle 0.5/cycle 0.5/cycle
VPEXPAND* Latency Load Packed Data 5-cycle 5-cycle 5-cycle
Throughput 0.5/cycle 0.5/cycle 0.5/cycle
VPSHLD* Latency Vector Shift 1-cycle 1-cycle 1-cycle
Throughput 2/cycle 2/cycle 1/cycle
VAES* Latency Vector AES Instructions 3-cycle 3-cycle 3-cycle
Throughput 2/cycle 2/cycle 1/cycle
VPCLMUL Latency Vector Carry-Less Multiply 6-cycle 8-cycle 8-cycle
Throughput 1/cycle 0.5/cycle 0.5/cycle
GFNI Latency Galois Field New Instructions 3-cycle 3-cycle 3-cycle
Throughput 2/cycle 2/cycle 1/cycle

For all of the common AVX2 instructions, xmm/ymm latencies and throughputs are identical to Skylake, however zmm is often a few cycles slower for DIV/SQRT variants.

Other Noticeable Observations

From our testing, we were also able to prove some of the other parts of the core, such as the added store ports and shuffle units.

Our data shows that the second store port is not identical to the first, which explains the imbalance when it comes to writes: rather than supporting 2x64-bit with loads, it only supports either 1x64-bit write, or 1x32-bit write, or 2x16-bit writes. This means we mainly see speed ups with GPR/XMM data, and the result is only a small improvement for 512-bit SCATTER instructions. Otherwise, it seems not to work with any 256-bit or 512-bit operand (you can however use it with 64-bit AVX-512 mask registers). This is going to cause a slight headache for anyone currently limited by SCATTER stores.

The new shuffle unit is only 256-bit wide. It will handle a number of integer instructions (UNPCK, PSLLDQ, SHUF*, MOVSHDUP, but not PALIGNR or PACK), but only a couple of floating point instructions (SHUFPD, SHUFPS).

Cache and TLB Updates SPEC2017 and SPEC2006 Results (15W)
Comments Locked

261 Comments

View All Comments

  • Phynaz - Friday, August 2, 2019 - link

    What? TDP doesn’t mean what you think it does.
  • Alexvrb - Monday, August 5, 2019 - link

    I didn't feel like quoting the entire paragraph. But please DO elaborate. Then tell me how useful TDP is when they let OEMs set PL2 and Tau to... anything, really. You can take two "95W" processors and their power and thermals under load are radically different across a range of mainboards. The is reflected in mobile as well, where they let OEMs do pretty much whatever - the results aren't constrained by the processor no matter what the claimed TDP is. That doesn't even COUNT overclocking.

    Meanwhile AMD chips don't hand over control to mainboards unless you ARE overclocking, which is how it SHOULD be.
  • Alistair - Friday, August 2, 2019 - link

    I didn't see any discussion or comparison vs. the i7-9850H. Let's see a 28W TDP version of the 6 core i7-9850H put against these new chips. Same money, 50 percent more cores. Anyone in their right mind should be looking for an i7-9850H or 9750H laptop instead over these 10nm products. Where is the 6 core 10nm CPU? Don't buy a 4 cores laptop if you're looking for good performance in 2019-2020 imo.

    If you want a 4 core laptop get a cheaper 14nm based laptop. If you want performance get a 6 core. I really really don't see the point in these products.
  • Alexvrb - Friday, August 2, 2019 - link

    They gotta do *something* with all those 10nm wafers. Ian can't eat them all, and China said they don't want any more half-baked 10nm products after the last go-around. Maybe in 2020 we'll see 10nm++ and it will be as good as phase one 10nm was supposed to be.

    But yeah, their current 10nm products are a bit disappointing outside of the fatter GPUs and better memory speeds. If you're using something with a dGPU there's little point vs their own 14++, it only starts to make sense if you want AMD-like iGPU performance with the latest Core processor design. Even then that's only limited to models with a high EU count (48+) as the 32 EU models just look meh.

    They're going to have some stiff competition when 7nm Zen 2 APUs launch. I guess that's why they're attacking the low-power first, as AMD is still stuck on 12nm rehash Zen+ products for now.
  • InvidiousIgnoramus - Friday, August 2, 2019 - link

    I still find it amusing that the architecture with "Ice" in it's name has low clock speeds presumably from power/heat issues.
  • abufrejoval - Friday, August 2, 2019 - link

    Great work! And kudos to AMD to make Intel work so much harder to get good news out!

    Two die carrier layouts but the chips looking identical:

    First of all, I assume that the bigger and square chip is essentially the North-Bridge in 14nm?

    And the smaller rectangular one the CPU+iGPU?

    And I guess at 64EU we are talking about more than 60% of die area going to iGPU while even at quad core and AES-512 the CPU + cache will be perhaps 30%?

    Is there any HSA or GPGPU compute to 'pay' for that iGPU surface and power in professional workloads?

    Or is it really just for gaming?

    Am I also correct to assume that of the extra thermal budget in the 28Watt parts, none really goes to the CPU, only allows it to stay within the 15 Watt envelope while the iGPU is also running?

    Are we talking different die layouts and sizes for dual/quad CPUs and 64/32 iGPU EUs or is it really all just binning, meaning that an Core i3-1000G1 is a chip where 70% surface area of an Core i7-1060G7 failed to make it?

    Why am I thinking they are heading down a path without consumer value returns?

    I got a Lenovo S730 i7-8565U or Whisky Lake recently for a little over €1000 and I got a couple of J5005 Atoms recently for a little over €100 (admittedly complete notebook vs. RAM less Mini-ITX mainboard). The difference in power is 15 vs 10 Watts.

    Both are fairly competent 2D machines even at 4k. Both are terrible gaming machines, but I don't really think that ultrabook portable gaming performance is a selling point.

    If I were free to choose CPU vs. GPU real-estate, I'd definitely go left, say 6 or 8 CPU cores or just higher sustained turbos and make do with the J5005's 18 iGPU EUs, because CPU power is what I profit from professionally.

    For GPU, every € I spend gets me vastly more gaming experience in less mobile form factors, which is fine: I don't see how I could run in a game and outside without breaking my newest toy.
  • Sahrin - Friday, August 2, 2019 - link

    $426 for a quad core in 2019. What a time to be alive.
  • eva02langley - Friday, August 2, 2019 - link

    So basically... expensive, low yield, 4 cores, low frequency.

    Outside of better IGPU, barely matching AMD offering, and AVX512, which is not even a matter for a 4 cores CPU, 10nm is an abysmal failure.
  • Phynaz - Friday, August 2, 2019 - link

    So basically....you’re an imbecile
  • Korguz - Friday, August 2, 2019 - link

    your one to talk phynaz, i guess you want to be stuck on quad cores in notebooks for ever ???

Log in

Don't have an account? Sign up now