The Ice Lake Benchmark Preview: Inside Intel's 10nm

Name: The Ice Lake Benchmark Preview: Inside Intel's 10nm
Item: The Ice Lake Benchmark Preview: Inside Intel's 10nm
Author: Dr. Ian Cutress

by Dr. Ian Cutress on August 1, 2019 9:00 AM EST

261 Comments | Add A Comment

261 Comments

Security Updates, Improved Instruction Performance and AVX-512 Updates

With every new microarchitecture update, there are goals on several fronts: add new instructions, decrease the latency of current instructions, increase the throughput of current instructions, and remove bugs. The big headline addition for Sunny Cove and Ice Lake is AVX-512, which hasn’t yet appeared on a mainstream widely distributed consumer processor – technically we saw it in Cannon Lake, but that was a limited run CPU. Nonetheless, a lot of what went into Cannon Lake also shows up in the Sunny Cove design. To complicate matters, AVX-512 comes in plenty of different flavors. But on top of that, Intel also made a significant number of improvements to a number of instructions throughout the design.

Big thanks to InstLatX64 for his help in analyzing the benchmark results.

Security

On security, almost all the documented hardware security fixes are in place with Sunny Cove. Through the CPUID results, we can determine that SSBD is enabled, as is IA32_ARCH_CAPABILITIES, L1D_FLUSH, STIBP, IBPB/IBRS and MD_CLEAR.

This aligns with Intel’s list of Sunny Cove security improvements:

Sunny Cove Security
AnandTech	Description	Name	Solution
BCB	Bound Check Bypass	Spectre V1	Software
BTI	Branch Target Injection	Spectre V2	Hardware+OS
RDCL	Rogue Data Cache Load	V3	Hardware
RSSR	Rogue System Register Read	V3a	Hardware
SSB	Speculative Store Bypass	V4	Hardware+OS
L1TF	Level 1 Terminal Fault	Foreshadow	Hardware
MFBDS	uArch Fill Buffer Data Sampling	RIDL	Hardware
MSBDS	uArch Store Buffer Data Sampling	Fallout	Hardware
MLPDS	uArch Load Port Data Sampling	-	Hardware
MDSUM	uArch Data Sampling Uncachable Memory	-	Hardware

Aside from Spectre V1, which has no suitable hardware solution, almost all of the rest have been solved through hardware/firmware (Intel won’t distinguish which, but to a certain extent it doesn’t matter for new hardware). This is a step in the right direction, but of course it may have a knock-on effect, plus for anything that gets performance improvements being moved from firmware to hardware will be rolled into any advertised IPC increase.

Also on the security side is SGX, or Intel’s Software Guard Instructions. Sunny Cove now becomes Intel’s first public processor to enable both AVX-512 and SGX in the same design. Technically the first chip with both SGX and AVX-512 should have been Skylake-X, however that feature was ultimately disabled due to failing some test validation cases. But it now comes together for Sunny Cove in Ice Lake-U, which is also a consumer processor.

Instruction Improvements and AVX-512

As mentioned, Sunny Cove pulls a number of key improvements from the Cannon Lake design, despite the Cannon Lake chip having the same cache configuration as Skylake. One of the key points here is the 64-bit division throughput, which goes from a 97-cycle latency to an 18-cycle latency, blowing past AMD’s 45-cycle latency. As an ex-researcher with no idea about instruction latency or compiler options, working on high-precision math code, this speedup would have been critical.

IDIV -> 97-cycle to 18-cycle

For the general purpose registers, we see a lot of changes, and most of them quite sizable.

Sunny Cove GPR Changes
AnandTech	Instruction	Skylake	Sunny Cove
Complex LEA	Complex Load Effective Address	3 cycle latency 1 per cycle	1 cycle latency 2 per cycle
SHL/SHR	Shift Left/Right	2 cycle latency 0.5 per cycle	1 cycle latency 1 per cycle
ROL/ROR	Rotate Left/Right	2 cycle latency 0.5 per cycle	1 cycle latency 1 per cycle
SHLD/SHRD	Double Precision Shift Left/Right	4 cycle latency 0.5 per cycle	4 cycle latency 1 per cycle
4*MOV	Four repated string MOVS	Limited instructions	104 bits/clock All MOVS* Instructions

In the past we’ve seen x87 instructions being regressed, made slower, as they become obsolete. For whatever reason, Sunny Cove decreases the FMUL latency from 5 cycles to 4 cycles.

The SIMD units also go through some changes:

Sunny Cove SIMD
AnandTech	Instruction	Skylake	Sunny Cove
SIMD Packing	SIMD Packing now slower	1 cycle latency 1 per cycle	3 cycle latency 1 per cycle
AES*	AES Crypto Instructions (for 128-bit / 256-bit)	4 cycle latency 2 per cycle	3 cycle latency 2 per cycle
CLMUL	Carry-Less Multiplication	7 cycle latency 1 per cycle	6 cycle latency 1 per cycle
PHADD/PHSUB	Packed Horizontal Add/Subtract and Saturate	3 cycle latency 0.5 per cycle	2 cycle latency 1 per cycle
VPMOV* xmm	Vector Packed Move	2 cycle latency 0.5 per cycle	2 cycle latency 1 per cycle
VPMOV* ymm	Vector Packed Move	4 cycle latency 0.5 per cycle	2 cycle latency 1 per cycle
VPMOVZX/SX* xmm	Vector Packed Move	1 cycle latency 1 per cycle	1 cycle latency 2 per cycle
POPCNT	Microcode 50% faster than SW (under L1-D size)
REP STOS*	Repeated Store String	62 bits/cycle	54 bits/cycle
VPCONFLICT	Still Microcode Only

We’ve already gone through all of the new AVX-512 instructions in our Sunny Cove microarchitecture disclosure. These include the following families:

AVX-512_VNNI (Vector Neural Network Instructions)
AVX-512_VBMI (Vector Byte Manipulation Instructions)
AVX-512_VBMI2 (second level VBMI)
AVX-512_ BITALG (bit algorithms)
AVX-512_IFMA (Integer Fused Multiply Add)
AVX-512_VAES (Vector AES)
AVX-512_VPCLMULQDQ (Carry-Less Multiplacation of Long Quad Words)
AVX-512+GFNI (Galois Field New Instructions)
SHA (not AVX-512, but still new)
GNA (Gaussian Neural Accelerator)

(Intel also has the GMM (Gaussian Mixture Model) inside the core since Skylake, but I’ve yet to see any information on this outside a single line in the coding manual.)

For all these new AVX-512 instructions, it’s worth noting that they can be run in 128-bit, 256-bit, or 512-bit mode, depending on the data types passed to it. Each of these can have corresponding latencies and throughputs, which often get worse when going for the 512-bit mode, but overall assuming you can fill the register with a 512-bit data type, then the overall raw processing will be faster, even with the frequency differential. This doesn’t take into account any additional overhead for entering the 512-bit power state, it should be noted.

Most of these new instructions are relatively fast, with most of them only 1-3 cycles of latency. We observed the following:

Sunny Cove Vector Instructions
AnandTech		Instruction	XMM	YMM	ZMM
VNNI	Latency	Vector Neural Network Instructions	5-cycle	5-cycle	5-cycle
VNNI	Throughput	Vector Neural Network Instructions	2/cycle	2/cycle	1/cycle
VPOPCNT*	Latency	Return the number of bits set to 1	3-cycle	3-cycle	3-cycle
VPOPCNT*	Throughput	Return the number of bits set to 1	1/cycle	1/cycle	1/cycle
VPCOMPRESS*	Latency	Store Packed Data	3-cycle	3-cycle	3-cycle
VPCOMPRESS*	Throughput	Store Packed Data	0.5/cycle	0.5/cycle	0.5/cycle
VPEXPAND*	Latency	Load Packed Data	5-cycle	5-cycle	5-cycle
VPEXPAND*	Throughput	Load Packed Data	0.5/cycle	0.5/cycle	0.5/cycle
VPSHLD*	Latency	Vector Shift	1-cycle	1-cycle	1-cycle
VPSHLD*	Throughput	Vector Shift	2/cycle	2/cycle	1/cycle
VAES*	Latency	Vector AES Instructions	3-cycle	3-cycle	3-cycle
VAES*	Throughput	Vector AES Instructions	2/cycle	2/cycle	1/cycle
VPCLMUL	Latency	Vector Carry-Less Multiply	6-cycle	8-cycle	8-cycle
VPCLMUL	Throughput	Vector Carry-Less Multiply	1/cycle	0.5/cycle	0.5/cycle
GFNI	Latency	Galois Field New Instructions	3-cycle	3-cycle	3-cycle
GFNI	Throughput	Galois Field New Instructions	2/cycle	2/cycle	1/cycle

For all of the common AVX2 instructions, xmm/ymm latencies and throughputs are identical to Skylake, however zmm is often a few cycles slower for DIV/SQRT variants.

Other Noticeable Observations

From our testing, we were also able to prove some of the other parts of the core, such as the added store ports and shuffle units.

Our data shows that the second store port is not identical to the first, which explains the imbalance when it comes to writes: rather than supporting 2x64-bit with loads, it only supports either 1x64-bit write, or 1x32-bit write, or 2x16-bit writes. This means we mainly see speed ups with GPR/XMM data, and the result is only a small improvement for 512-bit SCATTER instructions. Otherwise, it seems not to work with any 256-bit or 512-bit operand (you can however use it with 64-bit AVX-512 mask registers). This is going to cause a slight headache for anyone currently limited by SCATTER stores.

The new shuffle unit is only 256-bit wide. It will handle a number of integer instructions (UNPCK, PSLLDQ, SHUF*, MOVSHDUP, but not PALIGNR or PACK), but only a couple of floating point instructions (SHUFPD, SHUFPS).

Cache and TLB Updates SPEC2017 and SPEC2006 Results (15W)

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

261 Comments

View All Comments

jospoortvliet - Friday, August 2, 2019 - link
Sometimes people have insightful additions or questions. That is never you so I wouldn’t miss your ‘input’.
Phynaz - Friday, August 2, 2019 - link
But yet you replied. Doh!
Korguz - Friday, August 2, 2019 - link
and so did you !!! :-)
Phynaz - Saturday, August 3, 2019 - link
Your comprehension skills aren’t that great, are they. Maybe that’s why you can’t afford a good cpu. Did you finish school?
Korguz - Saturday, August 3, 2019 - link
yep.. but you obliviously havent as only children resort to insults, like you do. and again.. grow up
POlaris1983 - Thursday, August 1, 2019 - link
Thermals and TDP are a test for UNdervolting and OCing on THICC laptops using ai windows OS GUI interface apps for easy one button flipping on and off for these CPUs and GPUs and RAM Timings customizations. Even for desktop towers soon using keyboard functions in special keys like on a laptop once they solve the luqid cooling issues on the THICC laptops.
thetrashcanisfull - Thursday, August 1, 2019 - link
Ian,
In this and the Ryzen 3000 review, I noticed that the 3DPM benchmarks with AVX enabled seem to benefit from AVX-512 much more than I would anticipate.

If I'm understanding things correctly, the AVX-512 parts are capable of 2x512b FMAC / cycle in the case of Skylake-server or 1x512b FMAC + 1x512b ALU / cycle in the case of Sunny Cove, with both handling 2x512b load + 1x512b store / cycle. This would suggest to me that their vector FP performance/cycle ought to be around double that of Skylake-client or Zen 2, both of which do 2x256b FMAC / cycle and 2x256b loads + 1x256b store / cycle. However, in the 3DPM benchmark we see AVX-512 CPUs outpace the performance/cycle of AVX2 CPUs by a factor of 4 - possibly even more than 4, once we account for the frequency penalties associated with AVX-512!

Am I misunderstanding some critical piece of the AVX-512 extension that explains this boost, or is there something wrong with the AVX2 codepath for this benchmark? Only using xmm instructions? Not using FMA instructions?
Mysticial - Friday, August 2, 2019 - link
A while back, Ian sent me the non-vectorized and AVX512-vectorized binaries for 3DPM for me to analyze. (I never looked at the AVX2 version since this was before it was made.)

Based on what I saw, I'm not at all surprised by the result. While I can't say that it fully explains such a large difference between AVX2 and AVX512, there are at least two things I noticed in the AVX512 binary that would contribute towards it.

1. There are 64-bit integer multiplies. AVX512 has the vpmullq instruction. AVX2 does not. Emulating this instruction in AVX2 is *extremely* costly.
2. The ratio of "heavy" to "light" AVX512 instructions is very low. Therefore, the 2nd FMA isn't needed to gain on AVX2.

I've never analyzed the AVX2 binary itself to see how that 64-bit multiply is being handled. It could be vectorized with extreme overhead, not vectorized at all, or worked-around at an algorithmic level.
thetrashcanisfull - Friday, August 2, 2019 - link
ohhhh... That makes more sense. I assumed that the 3DPM benchmark was doing primarily floating point math. I also didn't realize that AVX2 didn't support packed 64b muls... Thanks for the info!
Alexvrb - Friday, August 2, 2019 - link
"The suggested PL2 for Kaby Lake-R was 44W, so this might indicate a small jump in strategy."

Yeah, whereby TDP is virtually meaningless and every machine is a complete mystery box until you buy it and discover what actual thermals/power/performance are like - again regardless of the TDP. This is all without overclocking, mind you.

The Ice Lake Benchmark Preview: Inside Intel's 10nm

Security Updates, Improved Instruction Performance and AVX-512 Updates

Security

Instruction Improvements and AVX-512

Other Noticeable Observations

Post Your Comment

261 Comments

View All Comments

jospoortvliet - Friday, August 2, 2019 - link

Phynaz - Friday, August 2, 2019 - link

Korguz - Friday, August 2, 2019 - link

Phynaz - Saturday, August 3, 2019 - link

Korguz - Saturday, August 3, 2019 - link

POlaris1983 - Thursday, August 1, 2019 - link

thetrashcanisfull - Thursday, August 1, 2019 - link

Mysticial - Friday, August 2, 2019 - link

thetrashcanisfull - Friday, August 2, 2019 - link

Alexvrb - Friday, August 2, 2019 - link

Log in

Don't have an account? Sign up now