DL Boost and New Instructions: Intel’s AI Acceleration Attack

If you like buzzwords, then the field of artificial intelligence is full of them. Unless you are deeply embedded in the use or development of the technologies in this field, it can be pretty easy to be overwhelmed by all the acronyms and how they relate to each other. This complication is multiplied out by the new technologies that hardware manufacturers develop in order to support the methods the expert’s use, especially as those methods transition from software implementations to hardware accelerated methods.

Intel’s side of the equation on this is two-fold. On one level, the company wants to accelerate how these algorithms are trained – Intel wants to be able to provide the raw horsepower and bandwidth to push millions of results to train these algorithms in order to improve accuracy and efficiency. Training on this scale typically happens at the datacenter, and Intel will happily promote its FPGAs or Neural Network Processors to help do this.

The other side of the discussion is inference – the art of taking those trained algorithms and showing them new data to make accurate predictions. The simple example of inference is showing an algorithm that has been trained to identify different objects a new image that it has never seen before. Inference, in contrast to training, happens at the device level, is less computationally strenuous, but can be optimized to a super low power hardware accelerated implementation. There can also be a tradeoff between speed/power and accuracy. Again, the most obvious example of this is a smartphone identifying that the camera is pointed at food, or a sunset, or a building, or a cat.

The two sides of artificial intelligence, training and inference, happen at very different ends of the spectrum. Training occurs at the application developer level, while inference happens very much at the user level. For products that ultimately end up in the hands of end-users, being able to accelerate inference is a key to user experience.

However, Intel has a problem here. In fact, all companies that sell mobile PC products to end-users have a problem. The concepts of interference, such as identifying an item on a camera view screen, are easy to understand in a smartphone form factor. But apply that logic to a laptop, a notebook, or a desktop, and the applicability of how AI can improve the user experience starts to get a little distorted. It is not immediately obvious in what ways that AI can improve what a user does on these devices. The smartphone, by contrast, is an easier sell.

As a result, Intel has needed to go out into the community. Its new chip, Ice Lake, is a sledgehammer, and it’s looking for a nail, no matter how small. The truth of it is that a lot of end-user software does not have artificial intelligence built into it – any ‘intelligence’ that most software has is typically a bunch of ‘if this then that’ statements which do not fall in this category.

This ‘hammer looking for a nail’ implementation was very apparent at Computex 2019. Intel showcased a few demos with its new technology, however most of them were fairly niche. Two examples of this include:

  • Photo Album sorting: searching for all ‘beach’ photos in a photo album that hasn’t been manually tagged.
  • Converting a single 2D image into a 3D model for use in 3D visualization environments

The first case might be interesting to the wider public – and indeed Apple already does something like this on macOS – however the second use case is currently aimed at business or archival use, which is a specific niche. Ultimately Intel is looking to increase the amount of available tools, even if they are niche, which can be accelerated on Ice Lake. Intel is taking all suggestions on this.

How Intel is accelerating these inference models is through its ‘DL Boost’ technology, which is essentially a form of vector acceleration.

DL Boost and AVX-512

One of the features of the Ice Lake cores is that silicon is dedicated to AVX-512 operation. This allows for 512-bit instructions and math to be executed at once. With Ice Lake, it marks the first consumer platform from Intel to support the feature, if you exclude Cannon Lake which was not widely available. To date only Intel’s Xeon Skylake or newer silicon, or Xeon Phi, has had AVX-512 as part of the design. As we saw in our review of Cannon Lake, when AVX-512 code is run on these systems, it can be very effective: a dual core system with AVX-512 outscored a 16-core system running AVX2 (AVX-256).

AVX-512 forms the basis for Intel’s DL Boost libraries, which form part of several SDKs in Intel’s software strategy, such as the OpenVINO toolkit. DL Boost takes advantage of the new core’s abilities to reform neural network calculations in the form of lower precision calculations, from FP32 to FP16 to INT8 and even down to INT2. The balance here is between accuracy, power, and computational complexity: for example, in most non-critical situations, dropping the accuracy of an inference calculation from 94% to 93% accuracy would be worth it if it reduced latency and power consumption by half. DL Boost accelerated libraries also allow for mathematical numerical transforms in neural network calculations to further increase throughput and decrease power.

Ultimately DL Boost covers the gamut from training to inference – Intel promotes its use on the server Xeon product line for training, whereas the Ice Lake consumer line of products will be focused on inference. Again, its usefulness when it comes to consumer products will initially be limited, given the lack of common AI inference workloads on modern systems. Intel promotes the AIXPRT synthetic benchmark suite as an example of performance uplift, however that focuses almost entirely on image recognition.

New Instructions

Alongside DL Boost, Intel has implemented a series of AVX-512 instructions for other use cases.

AVX-512_VBMI is one layer of Vector Byte Manipulation Instructions, which includes permutes and shifts:

  • VPERMB: 64-byte any-to-any shuffle, 3 clocks, 1 per clock
  • VPERMI2B: 128-byte any-to-any overwriting indexes, 5 clocks, 1 per 2 clocks
  • VPERMT2B: 128-byte any-to-any overwriting tables, 5 clocks, 1 per 2 clocks
  • VPMULTISHIFTQB: Base64 conversion, 3 clocks, 1 per clock

AVX-512_VBMI2 takes this a stage further, enabling expand and compress functionality

  • VPCOMPRESSB: Store sparse packed byte integer values into dense memory/register
  • VPCOMPRESSW: Store sparse packed word integer values into dense memory/register
  • VPEXPANDB: Load sparse packed byte integer values from dense memory/register
  • VPEXPANDW: Load sparse packed word integer values from dense memory/register
  • VPSHLD: Concatenate and shift packed data left logical
  • VPSHRD: Concatenate and shift packed data right logical
  • VPSHLDV: Concatenate and variable shift packed data left logical
  • VPSHRDV: Concatenate and variable shift packed data right logical

AVX-512_BITALG enables a number of highly sort-after bit algorithms:

  • VPOPCNTB: Return the number of bits set to 1 in a byte
  • VPOPCNTW: Return the number of bits set to 1 in a word
  • VPSHUFBITQMB : Shuffles bits from quadword elements using byte indexes into mask

AVX-512_IFMA are 52-bit Integer fused multiply add (FMA) instructions that behave identically to AVX-512 floating point FMA, but offering a latency of four clocks and a throughput of two per clock (for xmm/ymm, zmm is four and one). This instruction is commonly listed as helping cryptographic functionality, but also means there is now added support for arbitrary precision arithmetic.

  • VPMADD52LUQ: Packed multiply of unsigned 52-bit integers and add the low 52-bit products to qword accumulators
  • VPMADD52HUQ: Packed multiply of unsigned 52-bit integers and add the high 52-bit products to 64-bit accumulators

Alexander Yee, the developer of the hyper-optimized mathematical constant calculator y-cruncher, explained to be why IFMA helps his code when calculating constants like Pi:

The standard double-precision floating-point hardware in Intel CPUs has a very powerful multiplier that has been there since antiquity. But it couldn't be effectively tapped into because that multiplier was buried inside the floating-point unit. The SIMD integer multiply instructions only let you utilize up to 32x32 out of the 52x52 size of the double-precision multiply hardware with additional overhead needed. This inefficiency didn't go unnoticed, so people ranted about it, hence why we now have IFMA.

The main focus of research papers on this is that big number arithmetic that wants the largest integer multiplier possible. On x64 the largest multiplier was the 64 x 64 -> 128-bit scalar multiply instruction. This gives you (64*64 = 4096 bits) of work per cycle. With AVX512, the best you can do is eight 32 x 32 -> 64-bit multiply via the VPMULDQ instruction, which gets you (8 SIMD lanes * 32*32 * 2FMA = 16384 bits) of work per cycle. But in practice, it ends up being about half of that because you have the overhead of additions, shifts, and shuffles competing for the same execution ports.

With AVX512-IFMA, users can unleash the full power of the double-precision hardware. A low/high IFMA pair will get you (8 SIMD lanes * 52*52 = 21632 bits) of work. That's 21632/cycle with 2 FMAs or 10816/cycle with 1 FMA. But the fused addition and 12 "spare bits" allows the user to eliminate nearly all the overhead that is needed for the AVX512-only approach. Thus it is possible to achieve nearly the full 21632/cycle of efficiency with the right port configuration (CNL/ICL only has 1 FMA).

There's more to the IFMA arbitrary precision arithmetic than just the largest multiplier possible. RSA encryption is probably one of the only applications that will get the full benefit of the IFMA as described above. y-cruncher benefits partially. Prime95 will not benefit at all.

For the algorithms that can take advantage of it, this boils down to the following table:

IFMA Performance
  Scalar x64 AVX512-F AVX512-IFMA
Single 512b FMA 4096-bit/cycle ~4000-bit/cycle 10816-bit/cycle
Dual 512b FMA 4096-bit/cycle ~8000-bit/cycle 21632-bit/cycle

AVX-512_VAES are vector extension instructions for AES commands. Current cores can enable 128-bit per-cycle, whereas using VAES instructions can enable 4x128-bit per cycle.

  • VAESDEC: Perform one round of an AES decryption flow
  • VAESENC: Perform one round of an AES encryption flow
  • VAESDECLAST: Perform last round of an AES decryption flow
  • VAESENCLAST: Perform last round of an AES encryption flow

AVX-512_VPCLMULQDQ is a single instruction also added to Ice Lake that enables carryless multiplication of a long quadword values.

  • VPCLMULQDQ: Carry-less multiplication quadword

AVX-512+GFNI are ‘Galois Field’ instructions.

  • GF2P8AFFINEINVQB: Galois field affine transformation inverse
  • GF2P8AFFINEQB: Galois field affine transformation
  • GF2P8MULB: Galois field multiply bytes

Beyond the AVX-512 instructions, there are a number of new instructions the core, particularly pertaining to cryptography. Ice Lake-U will support the SHA instructions, used for accelerating SHA-1 and SHA-256 algorithms:

  • SHA1RNDS4
  • SHA1NEXTE
  • SHA1MSG1
  • SHA1MSG2
  • SHA256RNDS2
  • SHA256MSG1
  • SHA256MSG2

The new Ice Lake core also has a Gaussian Neural Accelerator (GNA), which is a hardware block that enables low-power wake on voice. Intel also has the Gaussian Mixture Model (GMM) inside the core, as it has since Skylake, however Intel still has not provided much (if any) information on how it works. If you’re lucky, you might find it referenced in a CPU manual that it exists, but beyond this, little information is present. It stands to reason that the GNA will get the same treatment. Our best guess is that these units assist Microsoft Cortana for low-powered wake-on voice inference algorithms; however they don’t seem to be as open to other software developers, or perhaps it is but only under an NDA with Intel.

Gen11 Graphics: Competing for 1080p Gaming Two Versions of Ice Lake, Two Different Power Targets
Comments Locked

107 Comments

View All Comments

  • name99 - Wednesday, July 31, 2019 - link

    That’s an idiotic chain of reasoning.
    ARM Macs will ship with macOS, not iOS. To believe otherwise only reveals that you know absolutely nothing of how Apple thinks.

    As for comparison, the rough number is A12X gets ~5200 on GB4, Intel best (non-OC’d) gets ~5800. That’s collapsing lots of numbers down to one, but comparing benchmark by benchmark you see Apple does very well (almost matching Intel) across an awful lot.

    If Apple can maintain its past pace (and there is no reason why not...) we can expect A13X to be anywhere from 20% to 35% faster, which puts it well into “fastest [non-OC’d] CPU on earth” territory for most single-threaded use cases. Can they achieve this? Absolutely.
    Just process improvement can get them 10% frequency. I expect A13X to clock around 2.8GHz.
    Then there is LPDDR5 which I expect they will be using, so substantially improved memory bandwidth. Then I expect they'll have SVE (2x256) and accompanying that basically double the bandwidth all the way out from L1 to DRAM.
    These are just the obvious basics. There are a bunch of things they can still do that represent “fairly easy” improvements to get to that 25% or so. (These include more aggressive fusion, a double-pumped ALU, attached ALUs to load/store to allow load-ok and op-store fusion, a micro-op cache, long-term-parking, criticality prediction, ...)

    So, if it’s so easy, why doesn’t Intel also do it? Why indeed? That’s why I occasionally post my alternative rant about how INTC is no longer an engineering company, it is now pretty much purely a finance company...
  • ifThenError - Friday, August 2, 2019 - link

    Sorry, but both these comments seem mighty uninformed. The MacBooks Air and Pro currently and in the foreseeable future all run on Intel CPUs. The Apple Chips A12/13 are used in iPhone, iPad and the likes.

    And regarding your prediction, your enthusiasm seems way over the top. What are you even talking about? Micro-op cache on a RISC processor? Think again. Aren't RISC commands all micro ops already?
  • name99 - Sunday, August 4, 2019 - link

    Strong the Dunning-Kruger is with this one...
    Dude, seriously, learn something about MODERN CPU design, more than just buzz-words from the 80s.
    To get you started, how about you read
    https://www.anandtech.com/show/14384/arm-announces...
    and concentrate on understanding EVERY aspect of what's being added to the CPU and why.
    Note in particular that 1.5K Mop cache...

    More questions to ask yourself:
    - Why was 80s RISC obsessed with REDUCED instructions?
    - Why was ARM (especially ARMv8) NOT obsessed with that? Look at the difference between ARMv8 and, say, RISC-V.
    - Why is op-fusion so important a part of modern high performance CPUs (both x86 and ARM [and presumably RISC-V if they EVER ship a high-performance part, ha...])?
    - which are the fast (shallow logic, even if it's wide) and which are the slow (deep logic) parts of a MODERN pipeline?
  • ifThenError - Monday, August 5, 2019 - link

    Oh my, this is so entertaining you should charge for the reading.

    You demand to go beyond just buzz words (what would be good) while your posts look like entries to a contest on how many marketing phrases can be fit into a paragraph.
    Then you even manage to combine this with highly rude idiom. Plus you name a psychological effect but fail to transfer it to self-reflexion. And as cherry on the top you obviously claim for yourself to understand „EVERY aspect“ of a CPU (an unimaginably complex bit of engineering) but even manage to confuse micro- and macro-op cache and the conceptual differences of these.

    I'm really impressed by your courage. Publicly posting so boldly on such a thin basis is brave.
    Your comments add near zero information but are definately worth the read. Pure comedy gold!

    Please see this as an invitation to reply. I'm looking forwards to some more of your attempts to insult.
  • Techgeek43 - Tuesday, July 30, 2019 - link

    Fantastic article Ian, I for one, cannot wait for ice lake laptops
    Wonderful in-depth analysis, with an interesting insight into the Intel brand
  • repoman27 - Tuesday, July 30, 2019 - link

    "The high-end design with 64 execution units will be called Iris Plus, but there will be a ‘UHD’ version for mid-range and low-end parts, however Intel has not stated how many execution units these parts will have."

    Ah, but they have: Ice Lake-U Iris Plus (48EU, 64EU) 15 W, Ice Lake-U UHD (32EU) 15 W. So their performance comparisons may even be to the 15 W Iris Plus with 64 EUs, rather than the full fat 28 W version.

    I know you have access to the media slide decks, but Intel has also posted product briefs for the general public that contain a lot of this info: https://www.intel.com/content/www/us/en/products/d...

    "On display pipes, Gen11 has access to three 4K pipes split between DP1.4 HBR3 and HDMI 2.0b. There is also support for 2x 5K60 or 1x 4K120 with a 10-bit color depth."

    The three display pipes are not limited to 4K, and are agnostic of transport protocol—each of them can be output via the eDP 1.4b port, one of the 3 DDI interfaces which can support either DisplayPort 1.4 or HDMI 2.0b, or one of the up to 4 Thunderbolt 3 ports. Both HDMI and DP support HDCP 2.2, and DisplayPort also supports DSC 1.1. The maximum single pipe, single port resolution for HDMI is 4K60 10bpc (4:2:2), and for DisplayPort it's 4K120/5K60 10bpc (with DSC).

    Thunderbolt 3 integration for Ice Lake-Y is only up to 3 ports.
  • abufrejoval - Tuesday, July 30, 2019 - link

    What I personally liked most about the GT3e (48 EU) and GT4e (72 EU) Skylake variant SoCs was, that they didn't cost the extra money they should have, especially when you consider that the iGPU part completely dwarfs the CPU cores (which Intel makes you bleed for) and is much better than everything else combined together (have a look at the WikiChips layouts
    https://en.wikichip.org/wiki/intel/microarchitectu...

    Of course, a significantly better graphics performance is never a bad thing, especially when it also doesn't cost extra electrical power: The bigger iGPUs might have actually been more energy efficient than their GT2 brethren at a graphics load that pushed the GT2 towards its frequency limits. And in any case if you don't crunch it on graphics, the idle consumption is near perfect: One of the reasons most laptop dGPU designs won't even bother to run 2D on the dGPU any more but leave that to Intel.

    The biggest downside was that you couldn't buy them outside an Apple laptop or Intel NUC.

    But however much Intel goes into Apple mode (the major customer for these beefier iGPUs) in terms of "x time faster than previous", the result aren't going to turn ultrabooks with this configuration into "THD gaming machines".

    To have a good feel as to where these could go and whether they are worth the wait, just have a look at the Skull Canyon nuc6i7kyk review on this site: That SoC uses 72 EUs and 128MB of eDRAM and should put a pretty firm upper limit to what a 64 EU Ice Lake can do: Most of the games in that review are somewhat dated yet fail to reach 20FPS at THD.

    So if you want to game on the device, you'd be much better of with a dGPU however small and chose the smallest iGPU variant available. No reason to wait, Whisky + Nvidia will do better.

    If you want real gaming performance, you need to put real triple digit Watts and the bandwidth only GDDR5/6 or HBM can deliver to work even at THD, but with remote gaming perhaps it doesn't have to be on your elegant slim ultrabook. There again anything but the GT2 configuration is wasted, because only need the VPU part for decoding Google Stadia (or Steam Remote) streams, which is the same for all configurations.

    For some strange reason, Intel has been selling GT3/4 NUCs at little or no premium over GT2 variants and in that case I have been seriously tempted. And only once I even managed to find a GT3e laptop once for a GT2 price (while the SoC is literally twice as big and the die carrier even adds eDRAM at zero markup), which I stil cherish.

    But if prices are anywhere related to the surface area of the chip (as they are for the server parts), these high powered GTs are something that only Apple users would buy.

    That's another reaons, I (sadly) don't expect them to be sold in anything bug Macs and some NUCs, no ChuWi notebooks or Mini-ITX boards.
  • abufrejoval - Tuesday, July 30, 2019 - link

    ...(need edit)

    Judging from the first 10nm generation, GPUs where the part where obtaining economically feasible yields didn't work out. Unless they have really, really fixed 10nm it's not hard to imagine that Intel could be selling high-count EU SoCs to Apple below cost, to keep them for another generation as flagship customer and perhaps due to long-term contractual obligations.

    But maintaining GT2/3/4 price egality for the rest of the market seems suicidal even if you have a fab lead.

    Not that I expect we'll ever be told: In near monopoly situations the so called market ecnomy becomes surprisingly complex.
  • willis936 - Wednesday, July 31, 2019 - link

    What the hell is a THD in this context?
  • jospoortvliet - Monday, August 5, 2019 - link

    Probably full HD (True HD)?

Log in

Don't have an account? Sign up now