New Instructions and Updated Security

When a new generation of processors is launched, alongside the physical design and layout changes made, this is usually the opportunity to also optimize instruction flow, increase throughput, and enhance security.

Core Instructions

When Intel first stated to us in our briefings that by-and-large, aside from the caches, the new core was identical to the previous generation, we were somewhat confused. Normally we see something like a common math function get sped up in the ALUs, but no – the only additional changes made were for security.

As part of our normal benchmark tests, we do a full instruction sweep, covering throughput and latency for all (known) supported instructions inside each of the major x86 extensions. We did find some minor enhancements within Willow Cove.

  • CLD/STD - Clearing and setting the data direction flag - Latency is reduced from 5 to 4 clocks
  • REP STOS* - Repeated String Stores - Increased throughput from 53 to 62 bytes per clock
  • CMPXCHG16B - compare and exchange bytes - latency reduced from 17 clocks to 16 clocks
  • LFENCE - serializes load instructions - throughput up from 5/cycle to 8/cycle

There were two regressions:

  • REP MOVS* - Repeated Data String Moves - Decreased throughput from 101 to 93 bytes per clock
  • SHA256MSG1 - SHA256 message scheduling - throughput down from 5/cycle to 4/cycle

It is worth noting that Willow Cove, while supporting SHA instructions, does not have any form of hardware-based SHA acceleration. By comparison, Intel’s lower-power Tremont Atom core does have SHA acceleration, as does AMD’s Zen 2 cores, and even VIA’s cores and VIA’s Zhaoxin joint venture cores. I’ve asked Intel exactly why the Cove cores don’t have hardware-based SHA acceleration (either due to current performance being sufficient, or timing, or power, or die area), but have yet to receive an answer.

From a pure x86 instruction performance standpoint, Intel is correct in that there aren’t many changes here. By comparison, the jump from Skylake to Cannon Lake was bigger than this.

Security and CET

On the security side, Willow Cove will now enable Control-Flow Enforcement Technology (CET) to protect against a new type of attack. In this attack, the methodology takes advantage of control transfer instructions, such as returns, calls and jumps, to divert the instruction stream to undesired code.

CET is the combination of two technologies: Shadow Stacks (SS) and Indirect Branch Tracking (IBT).

For returns, the Shadow Stack creates a second stack elsewhere in memory, through the use of a shadow stack pointer register, with a list of return addresses with page tracking - if the return address on the stack is called and not matched with the return address expected in the shadow stack, the attack will be caught. Shadow stacks are implemented without code changes, however additional management in the event of an attack will need to be programmed for.

New instructions are added for shadow stack page management:

  • INCSSP: increment shadow stack pointer (i.e. to unwind shadow stack)
  • RDSSP: read shadow stack pointer into general purpose register
  • SAVEPREVSSP/RSTORSSP: save/restore shadow stack (i.e. thread switching)
  • WRSS: Write to Shadow Stack
  • WRUSS: Write to User Shadow Stack
  • SETSSBSY: Set Shadow Stack Busy Flag to 1
  • CLRSSBSY: Clear Shadow Stack Busy Flag to 0

Indirect Branch Tracking is added to defend against equivalent misdirected jump/call targets, but requires software to be built with new instructions:

  • ENDBR32/ENDBR64: Terminate an indirect branch in 32-bit/64-bit mode

Full details about Intel’s CET can be found in Intel’s CET Specification.

At the time of presentation, we were under the impression that CET would be available for all of Intel’s processors. However we have since learned that Intel’s CET will require a vPro enabled processor as well as operating system support for Hardware-Enforced Stack Protection. This is currently available on Windows 10’s Insider Previews. I am unsure about Linux support at this time.

Update: Intel has reached out to say that their text implying that CET was vPro only was badly worded. What it was meant to say was 'All CPUs support CET, however vPro also provides additional security such as Intel Hardware Shield'.

 

AI Acceleration: AVX-512, Xe-LP, and GNA2.0

One of the big changes for Ice Lake last time around was the inclusion of an AVX-512 on every core, which enabled vector acceleration for a variety of code paths. Tiger Lake retains Intel’s AVX-512 instruction unit, with support for the VNNI instructions introduced with Ice Lake.

It is easy to argue that since AVX-512 has been around for a number of years, particularly in the server space, we haven’t yet seen it propagate into the consumer ecosphere in any large way – most efforts for AVX-512 have been primarily by software companies in close collaboration with Intel, taking advantage of Intel’s own vector gurus and ninja programmers. Out of the 19-20 or so software tools that Intel likes to promote as being AI accelerated, only a handful focus on the AVX-512 unit, and some of those tools are within the same software title (e.g. Adobe CC).

There has been a famous ruckus recently with the Linux creator Linus Torvalds suggesting that ‘AVX-512 should die a painful death’, citing that AVX-512, due to the compute density it provides, reduces the frequency of the core as well as removes die area and power budget from the rest of the processor that could be spent on better things. Intel stands by its decision to migrate AVX-512 across to its mobile processors, stating that its key customers are accustomed to seeing instructions supported across its processor portfolio from Server to Mobile. Intel implied that AVX-512 has been a win in its HPC business, but it will take time for the consumer platform to leverage the benefits. Some of the biggest uses so far for consumer AVX-512 acceleration have been for specific functions in Adobe Creative Cloud, or AI image upscaling with Topaz.

Intel has enabled new AI instruction functionality in Tiger Lake, such as DP4a, which is an Xe-LP addition. Tiger Lake also sports an updated Gaussian Neural Accelerator 2.0, which Intel states can offer 1 Giga-OP of inference within one milliwatt of power – up to 38 Giga-Ops at 38 mW. The GNA is mostly used for natural language processing, or wake words. In order to enable AI acceleration through the AVX-512 units, the Xe-LP graphics, and the GNA, Tiger Lake supports Intel’s latest DL Boost package and the upcoming OneAPI toolkit.

10nm SuperFin, Willow Cove, Xe, and new SoC Cache Architecture: The Effect of Increasing L2 and L3
Comments Locked

253 Comments

View All Comments

  • blppt - Saturday, September 26, 2020 - link

    Sure, the box sitting right next to my desk doesn't exist. Nor the 10 or so AMD cards I've bought over the past 20 years.

    1 5970
    2 7970s (for CFX)
    1 Sapphire 290x (BF4 edition, ridiculously loud under load)
    2 XFX 290 (much better cooler than the BF4 290x) mistakenly bought when I thought it would accept a flash to 290x, got the wrong builds, for CFX)
    2 290x 8gb sapphire custom edition (for CFX, much, much quieter than the 290x)
    1 Vega 64 watercooled (actually turned out to be useful for a Hackintosh build)
    1 5700xt stock edition

    Yeah, i just made this stuff up off the top of my head. I guarantee I've had more experience with AMD videocards than the average gamer. Remember the separate CFX CAP profiles? I sure do.

    So please, tell me again how I'm only a Nvidia owner.
  • Santoval - Sunday, September 20, 2020 - link

    If the top-end Big Navi is going to be 30-40% faster than the 2080 Ti then the 3080 (and later on the 3080 Ti, which will fit between the 3080 and the 3090) will be *way* beyond it in performance, in a continuation of the status quo of the last several graphics card generations. In fact it will be even worse this generation, since Big Navi needs to be 52% faster than the 2080 Ti to even match the 3070 in FP32 performance.

    Sure, it might have double the memory of the 3070, but how much will that matter if it's going to be 15 - 20% slower than a supposed "lower grade" Nvidia card? In other words "30-40% faster than the 2080 Ti" is not enough to compete with Ampere.

    By the way, we have no idea how well Big Navi and the rest of the RDNA2 cards will perform in ray-tracing, but I am not sure how that matters to most people. *If* the top-end Big Navi has 16 GB of RAM, it costs just as much as the 3070 and is slightly (up to 5-10%) slower than it in FP32 performance but handily outperforms it in ray-tracing performance then it might be an attractive buy. But I doubt any margins will be left for AMD if they sell a 16 GB card for $500.

    If it is 15-20% slower and costs $100 more noone but those who absolutely want 16 GB of graphics RAM will buy it; and if the top-end card only has 12 GB of RAM there goes the large memory incentive as well..
  • Spunjji - Sunday, September 20, 2020 - link

    @Santoval, why are you speaking as if the 3080's performance characteristics are not already known? We have the benchmarks in now.

    More importantly, why are you making the assumption that AMD need to beat Nvidia's theoretical FP32 performance when it was always obvious (and now extremely clear) that it has very little bearing on the product's actual performance in games?

    The rest of your speculation is knocked out of what by that. The likelihood of an 80CU RDNA 2 card underperforming the 3070 is nil. The likelihood of it underperforming the 3080 (which performs like twice a 5700, non-XT) is also low.
  • Byte - Monday, September 21, 2020 - link

    Nvidia probably has a good idea how it performs with access to PS5/Xbox, they know they had to be aggressive this round with clock speeds and pricing. As we can see 3080 is almost maxed, o/c headroom like that of AMD chips, and price is reasonable decent, in line with 1080 launch prices before minepocalypse.
  • TimSyd - Saturday, September 19, 2020 - link

    Ahh don't ya just love the fresh smell of TROLL
  • evernessince - Sunday, September 20, 2020 - link

    The 5700XT is RDNA1 and it's 1/3rd the size of the 2080 Ti. 1/3rd the size and only 30% less performance. Now imagine a GPU twice the size of the 5700XT, thus having twice the performance. Now add in the node shrink and new architecture.

    I wouldn't be surprised if the 6700XT beat the 2080 Ti, let alone AMD's bigger Navi 2 GPUs.
  • Cooe - Friday, December 25, 2020 - link

    Hahahaha. "Only matching a 2080 Ti". How's it feel to be an idiot?
  • tipoo - Friday, September 18, 2020 - link

    I'd again ask you why a laptop SoC would have an answer for a big GPU. That's not what this product is.
  • dotjaz - Friday, September 18, 2020 - link

    "This Intel Tiger" doesn't need an answer for Big Navi, no laptop chip needs one at all. Big Navi is 300W+, no way it's going in a laptop.

    RDNA2+ will trickle down to mobile APU eventually, but we don't know if Van Gogh can beat TGL yet, I'm betting not because it's likely a 7-15W part with weaker Quadcore Zen2.

    Proper RDNA2+ APU won't be out until 2022/Zen4. By then Intel will have the next gen Xe.
  • Santoval - Sunday, September 20, 2020 - link

    Intel's next gen Xe (in Alder Lake) is going to be a minor upgrade to the original Xe. Not a redesign, just an optimization to target higher clocks. The optimization will largely (or only) happen at the node level, since it will be fabbed with second gen SuperFin (formerly 10nm+++), which is supposed to be (assuming no further 7nm delays) Intel's last 10nm node variant.
    How well will that work, and thus how well 2nd gen Xe will perform, will depend on how high Intel's 2nd gen SuperFin will clock. At best 150 - 200 MHz higher clocks can probably be expected.

Log in

Don't have an account? Sign up now